Some information about the issues on Saturday 8th December 2018

  • 29

    Hi I am Rob Stothard Technical Director of Services at Rare, I wanted to give a little insight as to what happened to Sea of Thieves on Saturday 8th December. For those that don't know, at around 8am(GMT) a proportion of players started receiving CinnamonBeard errors when trying to join Sea Of Thieves, players that were in game at that time started to have issues when handing in chests or completing commendations.

    Unfortunately, the issue did not trigger any of our automated alerting and continued for about an hour before we became aware and started to investigate.

    The issue manifested as a failure to find a server during server matchmaking, resulting in a CinnamonBeard error at the client. As we started to investigate, it appeared that all the services involved in Server Matchmaking were working fine and that the problem must be in some other part of the system. The matchmaking process in Sea of Thieves is a two step process, first we use Xbox Live matchmaking to organise players into Crews and then we use our own Server Matchmaking service to find a suitable server for that Crew. Having determined that our own matchmaking was operating normally we enquired of the Xbox Live team if there was an issue with the Live Matchmaking and they also reported that there were no issues.

    Having determined that the specific services involved were all operating fine we started to investigate lower down in our technology stack. For Sea Of Thieves some of our inter service communication uses a Pub/Sub architecture backed by a cluster of RabbitMQ servers as the Message Broker. At around 8am that morning some of the Nodes in the cluster had had an interruption to network connectivity resulting in the cluster becoming split. Essentially, our RabbitMQ servers had split into three discrete clusters, each one not communicating with the other two, what RabbitMQ refers to as a Network Partition or Split Brain.

    Once we discovered the root issue we were able, fairly quickly, to bring the cluster back into a good state and then gradually bring the various affected game services back online.

    There were a lot of things that we have learn't from this incident and that we will take away to correct.

    RabbitMQ can be configured to self heal when a Network Partition occurs, we have now made this change to our RabbitMQ install and it will make it's way to our production environment in the coming weeks.

    To enable us to respond quicker, we will add additional monitoring and alerting to the title. In this particular case the most telling symptom was the failure of players to join the game. This is a metric that we already track but do not alert on, we are now in the process of adding alerts to this metric.

    RabbitMQ itself was indicating very clearly that it was having problems, however, the status of RabbitMQ is not as readily visible to our Ops and Out Of Hours teams as it should be. Therefore we will be making additional information available to the Operations teams to allow them to diagnose issues like this much quicker.

    Unfortunately, due to the nature of the incident there were a range of ways in which players could be impacted, from not being able to get into the game to emblem progress not being recorded correctly. As a result we have decided to award anyone that played during the window an in game compensation when they next play the game.

  • 7

    @bobbles31 Thanks Rob greatly appreciate the info and update!

  • 3

    It's really cool of you guys to elaborate a bit, and to compensate players. Keep up the great work!

  • 3

    @bobbles31 Very cool. I'm currently working at a place that is migrating from websphere to AWS and I got zero JavaEE experience in college. I'm hoping to get into the game industry in the future and didn't realize how much my experiences here could translate to that industry :)

  • 4

    @bobbles31 Thank you once again for being so open and honest about the issues you had. Fortunately it didn't have any impact on myself, but fair play to Rare for realising it may have to many others, and you are now in control of it. Many thanks Rob, keep up the good work ;)

  • 3

    @bobbles31 Thanks for the transparency! Always great to feel like we are kept in the loop!

  • 3

    @bobbles31 Thanks for the update. Also an interesting little insight!

    I gather this is why some gold and doubloons dropped in to my account when I logged in just now?

  • 4

    @luciansanchez82 No that drop was from me - I heard you needed a quick fix loan... remember 39.9% APR (variable) - I'll be in contact with T's & C's shortly ;)

  • 0

    While that all went way over my head aha! What’s that about rabbits? Pirate rabbits maybe? God knows.

    Either way thanks for keeping us informed :D

  • 3

    Last thing I want is to be in @j4dio's pocket!

    You won't see penny one from me!

  • 4

    @knifelife Yes RabbitsMQ (also known as: Rabbits Myxomatosis Qualified) is confirmed as pet's incoming..... and they bring Cinnamon for a Yuletide warm drink... ;)

  • 3

    @knifelife https://media.tenor.com/images/b8cc3152a343ac5ea5723a8edb0c5f45/tenor.gif

  • 7

    @knifelife Rabbits confirmed! We will need to feed them carrots on voyages

  • 4

    endless carrots for cottontail voyages..lol

  • 5

    @IceMan-0007 @J4dio @DuMy2008

    Year of the pirate rabbit confirmed!

    https://goo.gl/images/sbLJ6E

  • 4

    @j4dio said in Some information about the issues on Saturday 8th December 2018:

    @knifelife Yes RabbitsMQ (also known as: Rabbits Myxomatosis Qualified) is confirmed as pet's incoming..... and they bring Cinnamon for a Yuletide warm drink... ;)

    Naaah I'm sure it's the code name for fishing! ;)

  • 1

    Interesting insight into how the matchmaking process works. Thank you for keeping us updated.

  • 0

    @bobbles31 Could you define what the in-game compensation is? Is it a Flat Gold payment or BR Dabbloons, or a combination.

    Also what then Happened to all the message request that didn't get pass thru the cluster?

    Did they even get Logged?

    or

    Where they just droped? and if they were logged did they get deleted once the clusters were recalibrated?

    Also would it not be possible to set a Log for all request manged by the MQ server.

  • 1

    @enf0rcer said in Some information about the issues on Saturday 8th December 2018:

    @bobbles31 Could you define what the in-game compensation is? Is it a Flat Gold payment or BR Dabbloons, or a combination.

    Also what then Happened to all the message request that didn't get pass thru the cluster?

    Did they even get Logged?

    or

    Where they just droped? and if they were logged did they get deleted once the clusters were recalibrated?

    Also would it not be possible to set a Log for all request manged by the MQ server.

    Also thanks for the insight. As a former network professional it's nice to get a clear answer as to what happened. I do appreciate the transparency.

  • 1

    Thats where the rewards came from. Couldnt figure out last week.
    I logged in to receive:

    +1000 gold
    +5000 gold
    +10 doubloons

    Thanks :D

  • 0

    This post is deleted!