Recent Loss of Data Incidents

Bobbles31
Commander
Rare Employee
Hi Everyone,

My name is Rob and I am Technical Director of services for Rare. I'd like to take a moment to give some context to a couple of issues we have had recently that have led to loss of progress for players.

Last time I posted in January we had had some very serious data loss that continued for an extended period over Christmas. In the last few weeks we have had a couple of incidents that look similar from a players perspective but are in fact a very different type of incident.

On the 25th July and on the 2nd August we received reports from players that seasons notifications had stopped and that players were not being awarded for handing in treasure and the various other activities in the game that would otherwise represent progression.

The engineering team investigated and discovered that one of our services had stopped processing telemetry and that that telemetry had been backing up inside one of the message servers that we use for quite a while, and had in fact consumed sufficient disk space on the server to cause the entire messaging server to stop receiving new messages. The impact of this situation is that all new messages from existing players are lost and all connections from new players are refused.

How can this happen? How can we get into such a state and rely on the players to let us know?

The answer is that this situation has always been possible and indeed has happened many times in the past but we have automated alerting that notifies us if it occurs in plenty of time for an engineer to respond and correct whatever issue has led to the failure. So in reality, this issue was actually a failure of our automated alerting.

For the first half of this year, our focus was to eliminate the capacity problems seen over Christmas. We were keenly aware of the upcoming Season Three release and wanted to ensure that we were able to satisfy any increase in demand as a result. Our solution was to partition our services architecture, that means that performance of our services can be tuned to a maximum concurrent user count and if we need to support more than that, we deploy another partition and distribute the players across them.

This solution turned out to be both a blessing and a curse. It was a curse because it broke our monitoring in non-obvious ways, causing the issue that is the subject of this post, the telemetry that notifies us as to building up queues was now inaccurate and therefore alerts were not firing when they should preventing the team from responding before players were impacted. It was a blessing because the fault occurred in a single partition limiting the player impact to active players on that partition. New players were still able to eventually join and play as they were routed to a healthy partition. Had this event occurred before our capacity changes all players would have been affected and new players would have been unable to join. It would have taken many hours for us to resolve the issue and the game would have been unavailable to any new players during that time.

We are now working to resolve the issue with telemetry and ensure that we are able to respond when problems occur before they impact players.

Sea of Thieves is a continually changing and evolving title and, from time to time, there are new opportunities for us to learn where our work has not lived up to the players expectations and to figure out how to be better.

Thank you for taking the time to read this post

Kind Regards

Bobbles

3 years ago
Login To Reply
68
Posts
99.4k
Views
WolfManbush
Lord
yall will get the growing pains figured out

thanks for the transparency

3 years ago
xHxkqrii
Sailor
Sea of Bugs

3 years ago
Haruko
Commander
Insider
Founder
Thank you for the explanation; I appreciate the insight

3 years ago
s7m7n
Ship Mate
Thanks for working hard guys. Keep up the good work.

3 years ago
we love ink
ty for letting the community know, were all praying things get better, but complaining wont do anything ;)

3 years ago
RocketManKian
Champion
Partner
Insider
Founder
Thank you for this guys. Best of luck for getting it all working intentionally!

3 years ago
Jakob22712
Seafarer
Insider
ok rob thx for the explanation but I will continue to complain cuz I'm cool

3 years ago
STUMBLE B TUNA
Rogue
Insider
Are you doing work on the servers right now, because I can't login. Strawberrybeard keeps telling me the services are unavailable.

3 years ago
Musicmee
Vanguard
Insider
Moderator
Founder
Deckhand
@stumble-b-tuna Ahoy matey!

Those issues are currently being investigated. Shouldn't be too long before it's smooth sailing again!

3 years ago
RealStyli
Champion
Insider
Thanks for the update, Rob. This kind of transparency is very much appreciated. Here's hoping your teams' hard work will pay off.

3 years ago
I BreadGod I
Ship Mate
Thanks for everything you guys do! The clarification on this issue is extraordinarily helpful! Keep up the great work, this game is amazing!

3 years ago
ImperatorMorsus
Captain
Insider
Founder
The problem is fine as it is resolved, but the pain in the fact that a lot of work has not been lost on the player's end is rather painful.

Becoming Pirate Legends is tough enough...

3 years ago
xBRASH N SASSYx
Marauder
Appreciate everything you do!

3 years ago
z MakkaPakka
Ship Mate
Insider
@stumble-b-tuna

3 years ago
Methetron mkII
Commander
Insider
@jakob22712 said in Recent Loss of Data Incidents:

ok rob thx for the explanation but I will continue to complain cuz I'm cool

This made me chuckle

3 years ago
Captain Fixx
Captain
Insider
Founder
This explanation makes sense and I am relieved it is not the same issue as was experienced over the winter.

Thank you very much for the transparency.

3 years ago
CanduA1
Seafarer
Insider
Dont worry guys! :)

3 years ago
BuLLiX NL
Marauder
Insider
You would expect a monitoring system that reports growth of disk sizes or services being stopped that shouldnt be stopped over at a company as Rare.

3 years ago
DarthMaxor1991
Master
Insider
As a dedicated day 1 player and SoT streamer, all I ever ask for is explanations and transparency. This game has brought me into streaming, graphic designing, video editing and can't thank you all enough for that. But with explanations like this I can thoroughly explain these issues to my viewers and try to keep a positive outlook and mindset when it comes to Sea of Thieves. We appreciate your hard work and dedication and do understand that covid lockdowns deffo did not help the situation. You guys and gals keep your chin up and in this case i speak for my small community and myself, thank you Rare for being so open with us! Keep up the hard work!

3 years ago
HonestAuntyElle
Commander
Insider
So you guys are going to change your telemetry, so that if a deployed instance isn't reporting 'everything is ok' that it makes an alert, right?

Lack of bad signals is not the same thing as a decent heartbeat...

3 years ago
NotPirates
Rogue
Insider
Thanks for the transparency, keep up the good content!

3 years ago
Dieseldinho
Captain
Insider
Thanks for the breakdown. Your hard work does not go unnoticed.

3 years ago
Seapost
Seafarer
rare pls fix i have 5 athenas stack on my gallyeion!!!!!!!1!

3 years ago
Musicmee
Vanguard
Insider
Moderator
Founder
Deckhand
@seapost If you want I will keep an eye on them for you 👀

3 years ago
Legendary Liz
Vanguard
Insider
Moderator
Founder
Deckhand
@seapost I’ve removed your posts as they can be perceived as spam and are against Forum Rules

Spamming, Baiting and Trolling

Posts and threads that are created in order to spam, cause unrest or troll the community will be locked, deleted and the users involved warned.

These actions can be and are not limited to:

Creating threads, posts and content for the sole purpose of causing unrest
Making off topic posts to derail the conversation
Excessively using the same phrase, similar phrases, or gibberish
Bullying and encouraging users to bully others

Cheers!

3 years ago
HonestAuntyElle
Commander
Insider
Sorry, this was supposed to be an edit to my original post, but the cookies prevent me from posting to the forum most days, and I had to swap browsers several times in order to post just now. :-(

You say the telemetry was inaccurate though, So I assume you were still getting the signals from all instances, but you were getting mixed signals of "queue full" and "queue fine" and something wasn't sorted for separating out the partitions?

If true, I'm surprised (with hindsight) that this situation wasn't tested after the solution was made... One healthy, One unhealthy partition seems to be the normal failure case.

I don't mean to be an armchair developer, but it's simply not enough information to understand whether this is something that should be forgiven easily or not.

I thank you for the transparency that it wasn't the same issue as last time, and I trust that this was meant in good faith, but it feels like this is an attempt to save face, over actually letting the community know what happened (unlike previous communication with specifics about RabbitMQ etc).

As a player since launch, it feels like I've seen or heard of lost progress more times then I can count, whether it's specific instances of specific commendations, it 'feeling' like it's lost progress due to long delays in processing, etc.

I'm far more interested in a solution that
1. Notifies players IN GAME when a delay is occurring, so they don't stack loot, can opt not to play, and reduce load on failing servers.
2. Records progress in a safe fallback, so that it can be replayed later.
3. Receiving a summary once that progress has been counted.
At this point most regular players I know are fatigued enough from it that they no longer report to support when this happens.
3 years ago
Jumbie7311
Captain
Insider
It's really a pleasure to see posts like this. Thanks for being transparent.

3 years ago
xTheUrbanMenace
Seafarer
Insider
Thanks for the update 👍🏼

3 years ago
DZ The Gamer
Master
Insider
As buggy as the game is, I respect the transparency and openness. Thank you

3 years ago
l Snapper l
Commander
Insider
I appreciate these types of posts

3 years ago
CCircl3
Ship Mate
Thx for the insight :)

-GG

3 years ago
GWG Meshy
Castaway
Thank you for the details, I trust you with my time.

3 years ago

Posts

99.4k

Views

1 out of 68