Hi Everyone,
My name is Rob and I am Technical Director of services for Rare. I'd like to take a moment to give some context to a couple of issues we have had recently that have led to loss of progress for players.
Last time I posted in January we had had some very serious data loss that continued for an extended period over Christmas. In the last few weeks we have had a couple of incidents that look similar from a players perspective but are in fact a very different type of incident.
On the 25th July and on the 2nd August we received reports from players that seasons notifications had stopped and that players were not being awarded for handing in treasure and the various other activities in the game that would otherwise represent progression.
The engineering team investigated and discovered that one of our services had stopped processing telemetry and that that telemetry had been backing up inside one of the message servers that we use for quite a while, and had in fact consumed sufficient disk space on the server to cause the entire messaging server to stop receiving new messages. The impact of this situation is that all new messages from existing players are lost and all connections from new players are refused.
How can this happen? How can we get into such a state and rely on the players to let us know?
The answer is that this situation has always been possible and indeed has happened many times in the past but we have automated alerting that notifies us if it occurs in plenty of time for an engineer to respond and correct whatever issue has led to the failure. So in reality, this issue was actually a failure of our automated alerting.
For the first half of this year, our focus was to eliminate the capacity problems seen over Christmas. We were keenly aware of the upcoming Season Three release and wanted to ensure that we were able to satisfy any increase in demand as a result. Our solution was to partition our services architecture, that means that performance of our services can be tuned to a maximum concurrent user count and if we need to support more than that, we deploy another partition and distribute the players across them.
This solution turned out to be both a blessing and a curse. It was a curse because it broke our monitoring in non-obvious ways, causing the issue that is the subject of this post, the telemetry that notifies us as to building up queues was now inaccurate and therefore alerts were not firing when they should preventing the team from responding before players were impacted. It was a blessing because the fault occurred in a single partition limiting the player impact to active players on that partition. New players were still able to eventually join and play as they were routed to a healthy partition. Had this event occurred before our capacity changes all players would have been affected and new players would have been unable to join. It would have taken many hours for us to resolve the issue and the game would have been unavailable to any new players during that time.
We are now working to resolve the issue with telemetry and ensure that we are able to respond when problems occur before they impact players.
Sea of Thieves is a continually changing and evolving title and, from time to time, there are new opportunities for us to learn where our work has not lived up to the players expectations and to figure out how to be better.
Thank you for taking the time to read this post
Kind Regards
Bobbles