Recent Loss of Data Incidents

  • Hi everyone,

    My name is Rob and I'm Technical Director of services for Rare. I'm here because I wanted to take some time to talk to you about the issues we had over the festive period (and this past weekend), and how we're moving forward in preventing another issue of this nature.

    As I'm sure you're all aware, Sea of Thieves has gone from strength to strength in 2020. We've launched on Steam, we've continuously been releasing monthly content updates, and the seas became a place for people to connect when they couldn't meet in person. This all culminated in an extraordinarily busy period for Sea of Thieves over the Holiday period. Our game proved extremely popular throughout and up to Christmas 2020, and the Holiday period saw the most successful period for the title since launch in terms of traffic.

    Alongside everything else we shipped to Sea of Thieves in the year, in 2020 we also introduced campaigns that allow us to schedule events for players to experience between updates.

    Around 8pm on the 28th December 2020, the service that is responsible for tracking campaign progression began falling behind in processing the stream of events that are used to indicate player progression. As we passed through our peak daily player count, many millions of messages were waiting to be processed where ordinarily we would process them all immediately.

    Given our popularity over this period, this was the first time that this service had experienced load at this level. As a result, it was taking longer and longer for the service to record and report completion of an event by the player. Ordinarily, we have several mitigations that we use to affect the performance of a service in response to the load applied to it. However, in this case those mitigations had little-to-no effect on the amount of events that the service processed, and because of this the queue of messages increased.

    Throughout the hours and days that followed, our engineers shipped several performance updates to the affected service in an effort to resolve the incident or at least minimise the impact - however, whilst we were managing to make improvements, we were unable to make sufficient improvement to meet the demand being placed on the service, and the problem persisted.

    As our analysis and incident response continued, it became apparent that no matter what changes we made, we kept hitting a ceiling of performance meaning that something else was actually limiting the amount of work that this service could perform. Eventually we managed to determine that an unrelated, downstream service was causing our events system to limit the amount of work that could be completed by the impacted service.

    This unrelated service is a new under production service that we were auditioning behind the scenes to test loads, ahead of releasing new functionality in 2021. The purpose of auditioning the service was to validate that it would perform in retail conditions. It had been deployed late November 2020 long before we saw any issues, and our telemetry was giving no indication that it was struggling to keep up or that it was quietly applying back pressure to upstream services.

    As the service that was causing the issue was only being auditioned and not actually in use by players yet, we disabled it and the impacted service immediately responded by clearing down the backlog and returning to normal operating performance. However, when we switched the service back on last week, we saw the same scenario unfold again despite the mitigations we had taken against it.

    Here's how we are moving forward from this:

    • A retrospective and root cause analysis of this type of incident.
    • Monitoring new services more closely and having a natural suspicion of them during an impacting event.
    • Develop a better understanding and visibility of how services under pressure are impacting other services.
    • Looking at our architecture to break the chain of impact where one service can have an impact on another's performance.

    This is one of the highest impacting incidents we have had on Sea of Thieves since launch, and there's a lot to learn from what we experienced over this period. We know it wasn't a great period for Sea of Thieves players, and we're working hard to ensure the game's stability in future.

  • 66
    Posts
    48.1k
    Views
  • Honestly, this is the reason that I stopped playing SoT, after 2+ years of sailing the seas. Over the past ~6 months, there have been more and more of these situations, where players grind hard for various activity completions, and then the progress is lost, "unrecoverable", and the make-good offering is paltry at best.

    In a game where literally the only progression is completing commendations to unlock cosmetics, and most of those commendations require us to perform the same activity over and over and over again, you guys losing track of that is just plain unacceptable.

    And then offering us gold/doubloons as a peace offering adds insult to injury. Those are easily earned back. Collecting and turning in 5 Ghost Captain skulls, defeating Flameheart 6 times, those are the types of things that we're losing out on, and when you can't recover the data, it completely eliminates any motivation that we have to actually grind out these repeated tasks.

  • You guys are on a roll!
    A sinking ship rather....

  • As a dev, I feel ya.

    Thanks for the update!

  • I lost my gilded voyage during the first incident, a total 8 hour grind, as well as a lot of Athena and reaper's rep.

    Something like this has never happened to me before and I've never felt more beaten down by a 'bug' in a game.

    Honestly I wouldn't been half as upset if I had been shown some compassion from moderators, but every time I voiced my displeasure I was only met by ridicule by the mods.

  • Atleast the issue has neen found, and can go be fixed, i know it's definitely not easy to fix stuff like that anyway, just like hitreg, so hopefully it'll go smoothly rest of the year

  • @Bobbles31 Thank you for being open and straightforward regarding the issue, and your approach towards fixing it.

  • I'm going to agree with Lorenz. It's sad and really freaking annoying that we lost challenge progress, but it's good that the issue is going to be fixed in the future, and that eventually this will be resolved. However, I might not give currency out as a reward, instead I'd bring back a time limited cosmetic from the Pirate Legend Hideout and make it cheap so people can get things that they might not be able to get until PL.

  • I am dumbfounded as to why the pending messages in the reward queue were not backed up prior to doing work to mitigate the issue which could be (and apparently was) destructive to the data. Horrible stewardship of the effort of this playerbase.

    Achieving things, especially commendations in SoT is a grind. A grind which takes hours. The only reason players in my community knew to effectively stop playing in this most recent outage was due to our tie in to the Official Discord which also pings ours. Those players who were in game had no concept that there was a problem. You need to work on an ingame notification/warning system that alerts people that their progress may not count due to issues on your architecture side. The -only- sense we have of this as players in game is when crewmates start saying "wait a minute, I am not getting paid for this".

    Further the prior outage spanning 10 days should have been announced AS SOON AS there was a known issue. You allowed players to sail for over a week without telling them there was a risk it would not count.

    In addition the compensation here for our time is inadequate. 20k Gold? You mean what I can get in about 15 mins in game?

    You are about to launch a system which is entirely reliant on recognition of completing objectives to receive rewards in Plunder Pass and the system which supports that is 100% suspect to us all now. Grogmanny was illustrative to the problem. Challenges not counting and having to attempt them over and over and hoping it clicks in the rewards queue is not something that you should be pushing out until you KNOW you have this matter mitigated. Asking for us to consider paying for the premium tier of the pass given these issues is just irresponsible until we have some surety.

    Finally, knowingly re-enabling the process you are auditioning, knowing it was the root cause of the first outage, in your production environment, without informing the player base? You KNEW you were potentially introducing a problem and told no one. Again, very irresponsible and disrespectful of the time the players put into the game.

    EDIT: Entirely Tone Deaf to announce Season One with the system backing it still questionable

  • Thanks for posting this. Appreciate the open and honest communication from the developers.

  • as a dev i'd love to see the kind of architecture behind the scenes running this. i've always assumed some sort of SOA approach with message queues. it makes sense.

    thanks for the transparency on what's been going on and good luck with the ongoing work. as someone who's had to deal with load related issues at work i know how rough things can get, and i can only imagine the sort of volume you need to handle here.

  • Thanks for the update. Have to agree with the above though that offering gold and dubloons is paltry offering for players that have invested time to achieve grinding goals. You should perhaps offer free money to spend in the pirate emporium so perhaps a "ship set" would be fair recompense to all players who logged in during this time.

  • As a Network Administrator I feel for you, thanks for the update.

    I just enjoy my pirating adventures with friends from the other side of the world mainly, not much can take that away!

  • @rcadden dude, I get it, I played this game since launch, and its irritating when things like this happen, but try putting yourself in their shoes. They have to constantly make sure that the servers are doing alright, fix bugs, make content, give us the content, what more can you ask from them? They basically explained that they'll try to prevent this from happening in the future, so you just need to relax, just remember that all of us players suffered from this, but you going on an outburst isn't going to help the situation

  • @bobbles31 Thanks for the insight.

    Sometimes, things happen.

    Not everyone will be happy with the explanation but thank you and the team for your gesture of gold and doubloons.
    Something is always better than nothing :D

    @rcadden Me matey..... you say you’ve given up playing SoT, so how does this affect you? Lol :D

  • You are doing your best. Thanks for that.

  • @themetalman4011 Sorry you feel that way matey...

    We were just trying to inject a bit of fun into a bad situation... tying to lift the mood.

    A bit like the guys playing the instruments on the Titanic as it went down if you will :D
    (Not that this is the Titanic or sinking and I can't play the violin to save my life)

    Apologies!

  • I’m still missing ancient coins, personally I don’t care about normal coins or reputation I wanted the ancient coins is there anything you can do?

  • @jacklc05 Ahoy matey!

    If you purchased these while the servers were having issues. I would highly suggest you put in a ticket using the link below. Player Support will be able to help you out and reimburse you if required.

    Contact Support

  • Stuff happens. Nobody wants the issues to happen so I don't see any reason to get all huffy about it. Things will get fixed. Growing pains are common but temporary at least lots of people are playing

    It's unfortunate that I lost out on the 5 shrouded ghosts that I killed and the 400 FOTD's that I finished during that time period but it's alright

  • @wolfmanbush 👀

  • Thanks for the transparency, Rob and team. I know it's not an easy situation to manage, especially with people working from home.

    That said, would it not have been better to do a "maintenance" when it was planned on the 1st, rather than allow it to get worse?

    From what we know, Saturday the 2nd was the worst affected day regarding lost data?

    Could this have been prevented with downtime allowing the services to catch up?

    Would the fallout from taking the servers down for half a day have outweighed the current fallout from lost progress and where confidence in the game's stability is lower than ever?

    I know hindsight is 20:20, but I am hoping lessons are learned from the incidents regarding pre-emptive measures.

    It might seem like whining but a lot of people lost time doing activities and time is not something you ever get back.

    Happy Sailing for 2021 and hope everyone on the team stays safe.

  • Heres the the truth:

    -This isn't new, this has happened before many times.
    -You designed a game which requires at least 2-4+ hours to get anything meaningful done (according to commendations/travel/doing-tasks)
    -Your development strategies seem to favour new players vs the veterans (many here that have been with the game since Launch or earlier - cause we care, listen, provide feedback). And I understand its great to get more new players into the game, but I'm saying you're not retaining those players in the long-term.
    -New players from the Holiday break come into the game and experience these problems..
    -Data around this spike in players is actually lower than the peak when it first launched on Steam. And unfortunately, that trend is going down (because maybe new players are now coming to the same ways of thinking us Veterans have had all along...)

  • @lizalaroo I haven't played in a while, but I've got a PL20 pirate with a herd of pets and a load of coins, gold, doubloons just sitting there, friends who still play, and the new stuff they've added honestly looks cool.

    I'd like to sail the seas again, but I have absolutely zero confidence that anything I do would be worth the time to raise the anchor. It hit me directly last summer during one of the events - I grinded for 12 hours on a Saturday, turned in 10 Ghost Captain skulls (which would have left me within 4 of getting the sails), defeated Flameheart 2 times, and made good progress on a few other smaller commendations. That was the weekend that they lost progress. They told us multiple times over 3-4 days that it was fine, we could keep playing, they would catch up. And then they didn't, and they offered gold and doubloons, just like this. I have millions in gold and ~20k doubloons. I need none of that, I wanted the commendations I grinded for.

    With the latest new stuff they've added, I've considered jumping back in, but now 2x in the past 60 days they've just wholesale lost player progress again and done the same - "here's some gold/doubloons, sorry bout that mate".

  • I completely lost interest in both the 12 days of giving and grogmany events as I completed several tasks several times over and it would never register. Seemed futile.

    But it was only to earn things I would never probably use anyway.

    2020 was a tough year, and many lock downs here in the UK couldnt have made things easy on you all at Rare. It was an irratating bug, but not the end of the world. I'm over it 😉

  • @rcadden There are plenty of us with gold and doubloons coming out our ears so why play for the grind of it? Why sail for 12 hours straight? Have you tried just playing for the fun of it? To take in whatever comes along? I used to play the way you did but in the end I stopped enjoying the game and it became a chore. Now when we sail we start with a voyage but take on whatever or whoever comes our way. It’s been so much more enjoyable. I’d give it another go. Maybe sail for a couple hours a session. Find the enjoyment for the game you once had. So what if some data was lost. At least you didn’t lose your pirate, or all your commendations you had already earned. Does it really matter? :)

  • Thanks for the insight and working on preventing this type of incident but I still feel concerned on what is being done to prevent data loss if unfortunately that kind of thing happen again for different reasons?

    From what I understand, user data is saved only on the servers and if services are under a big load they can't process and save all the data incoming, it is forever lost and unrecoverable. Although you can compensate players with an average amount of Gold or Doubloon, the progression through commendations, reputation or achievements are not compensated. This is very concerning when things like that happen during time limited events, especially knowing when the Plunder Pass is coming, and ask players to do their "tasks" again because their progress done during the outage couldn't be saved on the servers and is unrecoverable.

    I don't have a big knowledge on that kind of things, but can't players data be secured locally on their devices and synced with the servers when they can process requests and save the data again?

    Players progression is probably more important than Gold and Doubloon currencies and personally I'd like to be reassured and have more visibility on if things will be done to recover data that couldn't be saved on servers during potential outage like that?

  • Thanks for explaining.

    You'll forgive me for being annoyed by a compensation as small as 20k... during a Gold and Glory weekend.

    I'm fine when its delayed but when Support suggests we wait 24 hours... considering we did before voicing our complaint... it gets frustrating.

    Hope SEASON 1 is delayed until its ready.

  • @rcadden
    I don't know how long you've been playing games but your response says "not very long".

    Every game is a grind. All of the first video games didnt even have a save file. Except maybe the score card at the end.

    As far as bugs, even the original Pac-Man had bugs.

    Rare is experiencing high log-in numbers and it caused 1 program to fail because of a cascade of information that lead to eventual corruption of data. They are working on it. The fact that this was the only failure and it can be fixed is great. They never got to stress test their programs at this magnitude before. It also means there are a lot of pirates still playing!

    And guess what!? The Sea Of Thieves Giveth and The Sea Of Thieves Taketh Away.

  • @voodoo-ic0n I understand the grind. I came to SoT from Ark, and am currently most playing Destiny. I get it, and I get that bugs are part of the program.

    My point is that they're acting like this is the first time it's happened, but it's not. It's happened several times over the past year, as far back as Fall 2019 when the Fort of the Damned was added, and their approach is always the same. "We'll fix it, keep playing, etc" "Oh, shoot, we can't fix it. Here's a "gift" that you could earn within 20 minutes in-game. Sorry about all the other stuff". In a game where literally the only progression is completing activities/commendations, there should be backups upon backups to ensure that progress is never lost.

    They have had, over the past what, 3 years, plenty of opportunity to stress-test their systems. Any good developer also knows that you don't build/launch the system with it right up against its limits. You build in some slack so that if you DO get a surge, you're covered for a bit to give you an allowance under which to fix things.

    If they're aware that the progression tracking isn't working, it's irresponsible and frankly pretty hostile to their players to stick to the party line of "keep playing, we'll fix it". They should take the servers offline to fix it.

  • @rcadden said in Recent Loss of Data Incidents:

    They have had, over the past what, 3 years, plenty of opportunity to stress-test their systems. Any good developer also knows that you don't build/launch the system with it right up against its limits. You build in some slack so that if you DO get a surge, you're covered for a bit to give you an allowance under which to fix things.

    I initially had the same thought, particularly when the game uses Microsoft's Azure servers that should scale to meet demand. It wouldn't be a good look for MS if that wasn't happening.

    However, going on what Rob posted, it doesn't sound like scalability is the issue (servers would have been down, if that was the case) but rather a link in the chain is broken when the data gets too great to process. Rob also says this functionality was brought in to test ahead of what I assume is the Season Pass tracking this year. This is not something they had planned from launch, so I can see how stress testing it would not have been possible to the extent that it was needed over the holidays.

    They likely thought the holiday events would be enough of a test but were over confident that it could be handled. Clearly, it wasn't.

    As much as I don't like what happened, at least it happened now before the Season Pass comes and they can work on fixing it.

    I do think the compensation was insufficient. As someone who 100%ed the events, I would have no problem in them retroactively awarding all goals to anyone who sailed during times where tracking was down. I think that would be only fair.

  • I've lost more stuff to server related issues than players sinking me.... sadge :(

  • /quote:
    Given our popularity over this period, this was the first time that this service had experienced load at this level. As a result, it was taking longer and longer for the service to record and report completion of an event by the player. Ordinarily, we have several mitigations that we use to affect the performance of a service in response to the load applied to it. However, in this case those mitigations had little-to-no effect on the amount of events that the service processed, and because of this the queue of messages increased. /quote:

    How can you say this , When the exact same thing happend in 2019 with the gold and glory weekend right after the holidays. You say this was the first time but it wasn't
    You also had these problems with the system yet again last weekend when you last minute announced the extra gold and glory weekend.
    You try to be open Yet you are not even giving exact numbers of anything you mention. this reads more as a sad and sloppy excuse to please the crowd then a well given analyses of what has been happening behind the scenes.

    Sure I understand nobody at Rare wants to see these types of failures, But stop pretending like this is something new and stop slapping your playerbase in the face with your pathetic compensations.

  • It's a really joke ????

    After 2 years always the same bug !

    I farm with my team 15 FOTD and another ship all event maybe about 8/12 millions gold and 7 hours lost for nothing because rare is incapable (we lose an ancien squeleton too !!!)

    Shame on you ! You re just joking about player

    I'm really disgusting rare doesnt take account of player !

  • I work in IT. Different causes can have the same effect, I have no reason not to believe Rob if he says this is the first time this has happened.

    If the data was SQL based, large portions of it could have been mainly held in RAM on multiple servers and you can not always retrieve data corruption free from logfiles..

66
Posts
48.1k
Views
22 out of 66