Hi I am Rob Stothard Technical Director of Services at Rare, I wanted to give a little insight as to what happened to Sea of Thieves on Saturday 8th December. For those that don't know, at around 8am(GMT) a proportion of players started receiving CinnamonBeard errors when trying to join Sea Of Thieves, players that were in game at that time started to have issues when handing in chests or completing commendations.
Unfortunately, the issue did not trigger any of our automated alerting and continued for about an hour before we became aware and started to investigate.
The issue manifested as a failure to find a server during server matchmaking, resulting in a CinnamonBeard error at the client. As we started to investigate, it appeared that all the services involved in Server Matchmaking were working fine and that the problem must be in some other part of the system. The matchmaking process in Sea of Thieves is a two step process, first we use Xbox Live matchmaking to organise players into Crews and then we use our own Server Matchmaking service to find a suitable server for that Crew. Having determined that our own matchmaking was operating normally we enquired of the Xbox Live team if there was an issue with the Live Matchmaking and they also reported that there were no issues.
Having determined that the specific services involved were all operating fine we started to investigate lower down in our technology stack. For Sea Of Thieves some of our inter service communication uses a Pub/Sub architecture backed by a cluster of RabbitMQ servers as the Message Broker. At around 8am that morning some of the Nodes in the cluster had had an interruption to network connectivity resulting in the cluster becoming split. Essentially, our RabbitMQ servers had split into three discrete clusters, each one not communicating with the other two, what RabbitMQ refers to as a Network Partition or Split Brain.
Once we discovered the root issue we were able, fairly quickly, to bring the cluster back into a good state and then gradually bring the various affected game services back online.
There were a lot of things that we have learn't from this incident and that we will take away to correct.
RabbitMQ can be configured to self heal when a Network Partition occurs, we have now made this change to our RabbitMQ install and it will make it's way to our production environment in the coming weeks.
To enable us to respond quicker, we will add additional monitoring and alerting to the title. In this particular case the most telling symptom was the failure of players to join the game. This is a metric that we already track but do not alert on, we are now in the process of adding alerts to this metric.
RabbitMQ itself was indicating very clearly that it was having problems, however, the status of RabbitMQ is not as readily visible to our Ops and Out Of Hours teams as it should be. Therefore we will be making additional information available to the Operations teams to allow them to diagnose issues like this much quicker.
Unfortunately, due to the nature of the incident there were a range of ways in which players could be impacted, from not being able to get into the game to emblem progress not being recorded correctly. As a result we have decided to award anyone that played during the window an in game compensation when they next play the game.