Solving The Scaling Riddle

// September 27th, 2008 // amazon ec2, amazon s3, chesscube, Openfire, programming, XMPP

Monday is a very big day for ChessCube. We are launching version 3 of our online chess playing client. Along with a redesigned interface, we have made some significant changes under the hood.

Version 2 of ChessCube was suffering under immense load created by the increasing popularity of the site. Under this load we were experiencing messages getting lost, very long login times, client interface crashes and, the worst of all, dreaded lag spikes. Lag can be described as the amount of time a message takes to go from the Flash client sitting in the browser to the server. A lag spike is when the overall lag in the system jumps for all users that are on the system. After hunting down the cause of these lag spikes we came to the conclusion that the system was unable to cope with the sheer amount of messages being sent and received between clients and the server. Putting our server software on a larger server would, in light of the steady growth of the site, just buy us a few more months. So the decision was taken to cluster the server software.

The chess playing component of ChessCube, we call it Chat internally, uses a protocol called XMPP. Now before your eyes glaze over and you go back to checking your mail, let me explain very simply what XMPP is. If you’ve ever used Google Talk or Facebook Chat, you would have used XMPP. It can simply be explained as a set of rules used to allow for chatting between people connected to a central server. For ChessCube, we chose Openfire as our XMPP server, for the reasons that its Java-based (a language the team predominantly uses), is easily extensible, due to its comprehensive plugin framework, and is Open Source.

Openfire is great for supporting a small chat community, but as soon as you need to scale above 5000 simultaneously online users it becomes very slow. This is where clustering comes in. Clustering is a term used to refer to a group of computers concurrently working together to spread load across them, thereby improving performance and in some architectures, removing a single point of failure. The company that supports Openfire does offer a clustering plugin but its prohibitively expensive – charges are on a per-user-basis rather than a per-server-basis.

In comes our trusty homemade architecture. Since each game of chess in Chat is played in a separate room we could distribute these rooms off the main Openfire server. So now we have a main Openfire server to handle all the stuff related to presence and chat; and smaller game servers that handle games. We can now have multiple game servers all communicating with the Openfire server about the status of games being played on them and the Openfire server in turn can distribute the games evenly over the game servers.

Distributing the load across game servers is handled with Amazon S3. Each game server writes its status to S3 and the Openfire server polls S3 to see which game servers are available and how much load each server is under. The Openfire server can then send clients to whichever server it feels is under the least amount of load. We can also do cool tricks with routing clients to servers that are nearest to them geographically. E.g. If two players from Europe want to play a game we can put them on a server in our German data centre. Lag is minimized and everybody is happy.

We have also created a customized instance image for game servers on Amazon EC2. Under extraordinary load we are able to bring new game servers online and running games in a matter of minutes.

This version of Chat goes live on Monday with four game servers, running on Amazon EC2. Hope to see you there.

One Response to “Solving The Scaling Riddle”

  1. Jeff Barr says:

    Looks like a great use of EC2. Drop us a note (awseditor at amazon.com) if you’d be interested in even more publicity.

Leave a Reply