Archive for amazon ec2

Smoothe As Silk – Well Almost…

// September 30th, 2008 // No Comments » // amazon ec2, chesscube, deployment, Openfire, programming, system admin


Launching a new version of your site is always stressful. Launching a site that spans multiple servers can be a nightmare. With a sound plan and practise, however, it can be less difficult than it is usually.

ChessCube Version 3 was launched on Monday and the release went relatively smoothly considering how huge the changes were to both the interface and the underlying architecture. In the past, we were a little too trigger happy in releasing a new version and failed to plan and test the release properly. For this release we had a target of 2 hours of downtime while we upgraded and uploaded software, changed configuration files and deployed the new website files.The changes to version 3, as I have already mentioned, were extensive. The main areas that were changed were:

  1. Improved interface and user experience. We wanted this version to appear to be more simple yet have more features. Instead of packing in thousands of buttons and labels everywhere, we worked hard to ensure that it all flowed correctly and that there wasn’t information overload. This is a lot more difficult than it sounds and the best way to go about it, is to forget as much as possible of the existing interface and try to come up with new and fresh ideas.
  2. A completely rebuilt and redesigned Paper. Paper is the name we have given to the component that sits alongside the board while playing or viewing a game and tracks the moves and chat between the players. You could also call it the Notation View or the Game History Panel. Due to the way Paper is implemented we had to reorganize popup windows and make sure that it resized and was positioned correctly.
  3. Multiple game server support. Clients would have to support and manage multiple connections to game servers and also work optimally with those game servers.
  4. Improved moderation & reporting tools. These changes were implemented to improve the moderators’ abilities to monitor users on the site and ensure that everything is running smoothly.

Other factors that made this release tricky were the upgrading of the version of Openfire that we were using and also having to move software onto more powerful servers, not to mention coordinate the rollout of our Amazon EC2 infrastructure.

The success of the release process was partly due to a few factors:

Testing exhaustively

I can’t stress this enough. If you build in 10% of the allotted project duration into your schedule for testing – double it. Get an independent tester – not a programmer or the boss. The boss can cause stress for the programmers when he finds a bug.

Simulating our live environment

We used to stage all our software on a little machine in our office and access the database on that machine using fancy GUI tools. If you live in South Africa and you have an average connection to the Internet, this is a mistake. The turnaround time for problems on a machine hosted internationally can cause a lot of stress. We commissioned a brand new server to go live on for our new software and set up our entire live architecture using this server, a local database and two EC2 game servers with the plan to rollout two more on launch day.

Setting up the staging environment a week before going live

This way all issues and their solutions you encounter while setting up the infrastructure and getting it running are still fresh in your system administrators’ and developers’ minds. There is nothing worse than forgetting how you solved a problem 6 months prior to release under the pressure of your system being down. We also planned to convert our staging environment to the live environment when the release launched. This meant downtime was kept to a minimum as all we had to do was change a few configuration files and point the server away from the staging database to the live database.

Making use of configuration files and logging wherever possible in our code

This is probably a moot point, but if you have a good set of external configuration files you won’t need to keep uploading new builds of your software if you have forgotten to change a server url.

Extensive log statements help to identify problems very quickly should your system stop functioning for whatever reason. Remember: Absence of logs means you are missing a log4j.xml somewhere (I learnt this one the hard way:) )

Holding on tight

You can’t plan for every eventuality and you can’t hold off your release until you feel the software is perfect. Sometimes you need to just release and see what happens. 15 minutes after releasing we realized we had a few bugs which were serious enough to warrant emergency bug fixes and we had a new version of the client software out 10 minutes later. This can be disruptive to your users but its better to disrupt early and fix the problems than handle all the bug reports and user complaints later.

Most of these tips seem obvious but they definitely played a huge roll in negating major disasters. Just remember, the hard work only really starts after you have launched and never deploy on a Friday.

Thanks to my colleagues: Dave, Tracy, James and Gideon for creating, what I think, is the best chess playing site out there and thanks to Bryan and his testing team for helping us find and swat all the bugs. Thanks to Margaret for her amazing GUI designs and thanks to Mark and the investors for giving us the opportunity to work on such a cool project.

Solving The Scaling Riddle

// September 27th, 2008 // 1 Comment » // amazon ec2, amazon s3, chesscube, Openfire, programming, XMPP

Monday is a very big day for ChessCube. We are launching version 3 of our online chess playing client. Along with a redesigned interface, we have made some significant changes under the hood.

Version 2 of ChessCube was suffering under immense load created by the increasing popularity of the site. Under this load we were experiencing messages getting lost, very long login times, client interface crashes and, the worst of all, dreaded lag spikes. Lag can be described as the amount of time a message takes to go from the Flash client sitting in the browser to the server. A lag spike is when the overall lag in the system jumps for all users that are on the system. After hunting down the cause of these lag spikes we came to the conclusion that the system was unable to cope with the sheer amount of messages being sent and received between clients and the server. Putting our server software on a larger server would, in light of the steady growth of the site, just buy us a few more months. So the decision was taken to cluster the server software.

The chess playing component of ChessCube, we call it Chat internally, uses a protocol called XMPP. Now before your eyes glaze over and you go back to checking your mail, let me explain very simply what XMPP is. If you’ve ever used Google Talk or Facebook Chat, you would have used XMPP. It can simply be explained as a set of rules used to allow for chatting between people connected to a central server. For ChessCube, we chose Openfire as our XMPP server, for the reasons that its Java-based (a language the team predominantly uses), is easily extensible, due to its comprehensive plugin framework, and is Open Source.

Openfire is great for supporting a small chat community, but as soon as you need to scale above 5000 simultaneously online users it becomes very slow. This is where clustering comes in. Clustering is a term used to refer to a group of computers concurrently working together to spread load across them, thereby improving performance and in some architectures, removing a single point of failure. The company that supports Openfire does offer a clustering plugin but its prohibitively expensive – charges are on a per-user-basis rather than a per-server-basis.

In comes our trusty homemade architecture. Since each game of chess in Chat is played in a separate room we could distribute these rooms off the main Openfire server. So now we have a main Openfire server to handle all the stuff related to presence and chat; and smaller game servers that handle games. We can now have multiple game servers all communicating with the Openfire server about the status of games being played on them and the Openfire server in turn can distribute the games evenly over the game servers.

Distributing the load across game servers is handled with Amazon S3. Each game server writes its status to S3 and the Openfire server polls S3 to see which game servers are available and how much load each server is under. The Openfire server can then send clients to whichever server it feels is under the least amount of load. We can also do cool tricks with routing clients to servers that are nearest to them geographically. E.g. If two players from Europe want to play a game we can put them on a server in our German data centre. Lag is minimized and everybody is happy.

We have also created a customized instance image for game servers on Amazon EC2. Under extraordinary load we are able to bring new game servers online and running games in a matter of minutes.

This version of Chat goes live on Monday with four game servers, running on Amazon EC2. Hope to see you there.