Recent Loss of Data Incidents

  • Hi Everyone,

    My name is Rob and I am Technical Director of services for Rare. I'd like to take a moment to give some context to a couple of issues we have had recently that have led to loss of progress for players.

    Last time I posted in January we had had some very serious data loss that continued for an extended period over Christmas. In the last few weeks we have had a couple of incidents that look similar from a players perspective but are in fact a very different type of incident.

    On the 25th July and on the 2nd August we received reports from players that seasons notifications had stopped and that players were not being awarded for handing in treasure and the various other activities in the game that would otherwise represent progression.

    The engineering team investigated and discovered that one of our services had stopped processing telemetry and that that telemetry had been backing up inside one of the message servers that we use for quite a while, and had in fact consumed sufficient disk space on the server to cause the entire messaging server to stop receiving new messages. The impact of this situation is that all new messages from existing players are lost and all connections from new players are refused.

    How can this happen? How can we get into such a state and rely on the players to let us know?

    The answer is that this situation has always been possible and indeed has happened many times in the past but we have automated alerting that notifies us if it occurs in plenty of time for an engineer to respond and correct whatever issue has led to the failure. So in reality, this issue was actually a failure of our automated alerting.

    For the first half of this year, our focus was to eliminate the capacity problems seen over Christmas. We were keenly aware of the upcoming Season Three release and wanted to ensure that we were able to satisfy any increase in demand as a result. Our solution was to partition our services architecture, that means that performance of our services can be tuned to a maximum concurrent user count and if we need to support more than that, we deploy another partition and distribute the players across them.

    This solution turned out to be both a blessing and a curse. It was a curse because it broke our monitoring in non-obvious ways, causing the issue that is the subject of this post, the telemetry that notifies us as to building up queues was now inaccurate and therefore alerts were not firing when they should preventing the team from responding before players were impacted. It was a blessing because the fault occurred in a single partition limiting the player impact to active players on that partition. New players were still able to eventually join and play as they were routed to a healthy partition. Had this event occurred before our capacity changes all players would have been affected and new players would have been unable to join. It would have taken many hours for us to resolve the issue and the game would have been unavailable to any new players during that time.

    We are now working to resolve the issue with telemetry and ensure that we are able to respond when problems occur before they impact players.

    Sea of Thieves is a continually changing and evolving title and, from time to time, there are new opportunities for us to learn where our work has not lived up to the players expectations and to figure out how to be better.

    Thank you for taking the time to read this post

    Kind Regards

    Bobbles

  • 68
    Posts
    99.4k
    Views
  • yall will get the growing pains figured out

    thanks for the transparency

  • Sea of Bugs

  • Thank you for the explanation; I appreciate the insight

  • Thanks for working hard guys. Keep up the good work.

  • ty for letting the community know, were all praying things get better, but complaining wont do anything ;)

  • Thank you for this guys. Best of luck for getting it all working intentionally!

  • ok rob thx for the explanation but I will continue to complain cuz I'm cool

  • Are you doing work on the servers right now, because I can't login. Strawberrybeard keeps telling me the services are unavailable.

  • @stumble-b-tuna Ahoy matey!

    Those issues are currently being investigated. Shouldn't be too long before it's smooth sailing again!

  • Thanks for the update, Rob. This kind of transparency is very much appreciated. Here's hoping your teams' hard work will pay off.

  • Thanks for everything you guys do! The clarification on this issue is extraordinarily helpful! Keep up the great work, this game is amazing!

  • The problem is fine as it is resolved, but the pain in the fact that a lot of work has not been lost on the player's end is rather painful.

    Becoming Pirate Legends is tough enough...

  • Appreciate everything you do!

  • @stumble-b-tuna

  • @jakob22712 said in Recent Loss of Data Incidents:

    ok rob thx for the explanation but I will continue to complain cuz I'm cool

    This made me chuckle

  • This explanation makes sense and I am relieved it is not the same issue as was experienced over the winter.

    Thank you very much for the transparency.

  • Dont worry guys! :)

  • You would expect a monitoring system that reports growth of disk sizes or services being stopped that shouldnt be stopped over at a company as Rare.

  • As a dedicated day 1 player and SoT streamer, all I ever ask for is explanations and transparency. This game has brought me into streaming, graphic designing, video editing and can't thank you all enough for that. But with explanations like this I can thoroughly explain these issues to my viewers and try to keep a positive outlook and mindset when it comes to Sea of Thieves. We appreciate your hard work and dedication and do understand that covid lockdowns deffo did not help the situation. You guys and gals keep your chin up and in this case i speak for my small community and myself, thank you Rare for being so open with us! Keep up the hard work!

  • So you guys are going to change your telemetry, so that if a deployed instance isn't reporting 'everything is ok' that it makes an alert, right?

    Lack of bad signals is not the same thing as a decent heartbeat...

  • Thanks for the transparency, keep up the good content!

  • Thanks for the breakdown. Your hard work does not go unnoticed.

  • rare pls fix i have 5 athenas stack on my gallyeion!!!!!!!1!

  • @seapost If you want I will keep an eye on them for you 👀

  • @seapost I’ve removed your posts as they can be perceived as spam and are against Forum Rules

    Spamming, Baiting and Trolling

    Posts and threads that are created in order to spam, cause unrest or troll the community will be locked, deleted and the users involved warned.

    These actions can be and are not limited to:

    Creating threads, posts and content for the sole purpose of causing unrest
    Making off topic posts to derail the conversation
    Excessively using the same phrase, similar phrases, or gibberish
    Bullying and encouraging users to bully others

    Cheers!

  • Sorry, this was supposed to be an edit to my original post, but the cookies prevent me from posting to the forum most days, and I had to swap browsers several times in order to post just now. :-(


    You say the telemetry was inaccurate though, So I assume you were still getting the signals from all instances, but you were getting mixed signals of "queue full" and "queue fine" and something wasn't sorted for separating out the partitions?

    If true, I'm surprised (with hindsight) that this situation wasn't tested after the solution was made... One healthy, One unhealthy partition seems to be the normal failure case.

    I don't mean to be an armchair developer, but it's simply not enough information to understand whether this is something that should be forgiven easily or not.

    I thank you for the transparency that it wasn't the same issue as last time, and I trust that this was meant in good faith, but it feels like this is an attempt to save face, over actually letting the community know what happened (unlike previous communication with specifics about RabbitMQ etc).


    As a player since launch, it feels like I've seen or heard of lost progress more times then I can count, whether it's specific instances of specific commendations, it 'feeling' like it's lost progress due to long delays in processing, etc.

    I'm far more interested in a solution that

    1. Notifies players IN GAME when a delay is occurring, so they don't stack loot, can opt not to play, and reduce load on failing servers.
    2. Records progress in a safe fallback, so that it can be replayed later.
    3. Receiving a summary once that progress has been counted.

    At this point most regular players I know are fatigued enough from it that they no longer report to support when this happens.

  • It's really a pleasure to see posts like this. Thanks for being transparent.

  • Thanks for the update 👍🏼

  • As buggy as the game is, I respect the transparency and openness. Thank you

  • I appreciate these types of posts

  • Thx for the insight :)

    -GG

  • Thank you for the details, I trust you with my time.

68
Posts
99.4k
Views
1 out of 68