Archive for system admin

Smoothe As Silk – Well Almost…

// September 30th, 2008 // No Comments » // amazon ec2, chesscube, deployment, Openfire, programming, system admin

Launching a new version of your site is always stressful. Launching a site that spans multiple servers can be a nightmare. With a sound plan and practise, however, it can be less difficult than it is usually.

ChessCube Version 3 was launched on Monday and the release went relatively smoothly considering how huge the changes were to both the interface and the underlying architecture. In the past, we were a little too trigger happy in releasing a new version and failed to plan and test the release properly. For this release we had a target of 2 hours of downtime while we upgraded and uploaded software, changed configuration files and deployed the new website files.The changes to version 3, as I have already mentioned, were extensive. The main areas that were changed were:

  1. Improved interface and user experience. We wanted this version to appear to be more simple yet have more features. Instead of packing in thousands of buttons and labels everywhere, we worked hard to ensure that it all flowed correctly and that there wasn’t information overload. This is a lot more difficult than it sounds and the best way to go about it, is to forget as much as possible of the existing interface and try to come up with new and fresh ideas.
  2. A completely rebuilt and redesigned Paper. Paper is the name we have given to the component that sits alongside the board while playing or viewing a game and tracks the moves and chat between the players. You could also call it the Notation View or the Game History Panel. Due to the way Paper is implemented we had to reorganize popup windows and make sure that it resized and was positioned correctly.
  3. Multiple game server support. Clients would have to support and manage multiple connections to game servers and also work optimally with those game servers.
  4. Improved moderation & reporting tools. These changes were implemented to improve the moderators’ abilities to monitor users on the site and ensure that everything is running smoothly.

Other factors that made this release tricky were the upgrading of the version of Openfire that we were using and also having to move software onto more powerful servers, not to mention coordinate the rollout of our Amazon EC2 infrastructure.

The success of the release process was partly due to a few factors:

Testing exhaustively

I can’t stress this enough. If you build in 10% of the allotted project duration into your schedule for testing – double it. Get an independent tester – not a programmer or the boss. The boss can cause stress for the programmers when he finds a bug.

Simulating our live environment

We used to stage all our software on a little machine in our office and access the database on that machine using fancy GUI tools. If you live in South Africa and you have an average connection to the Internet, this is a mistake. The turnaround time for problems on a machine hosted internationally can cause a lot of stress. We commissioned a brand new server to go live on for our new software and set up our entire live architecture using this server, a local database and two EC2 game servers with the plan to rollout two more on launch day.

Setting up the staging environment a week before going live

This way all issues and their solutions you encounter while setting up the infrastructure and getting it running are still fresh in your system administrators’ and developers’ minds. There is nothing worse than forgetting how you solved a problem 6 months prior to release under the pressure of your system being down. We also planned to convert our staging environment to the live environment when the release launched. This meant downtime was kept to a minimum as all we had to do was change a few configuration files and point the server away from the staging database to the live database.

Making use of configuration files and logging wherever possible in our code

This is probably a moot point, but if you have a good set of external configuration files you won’t need to keep uploading new builds of your software if you have forgotten to change a server url.

Extensive log statements help to identify problems very quickly should your system stop functioning for whatever reason. Remember: Absence of logs means you are missing a log4j.xml somewhere (I learnt this one the hard way:) )

Holding on tight

You can’t plan for every eventuality and you can’t hold off your release until you feel the software is perfect. Sometimes you need to just release and see what happens. 15 minutes after releasing we realized we had a few bugs which were serious enough to warrant emergency bug fixes and we had a new version of the client software out 10 minutes later. This can be disruptive to your users but its better to disrupt early and fix the problems than handle all the bug reports and user complaints later.

Most of these tips seem obvious but they definitely played a huge roll in negating major disasters. Just remember, the hard work only really starts after you have launched and never deploy on a Friday.

Thanks to my colleagues: Dave, Tracy, James and Gideon for creating, what I think, is the best chess playing site out there and thanks to Bryan and his testing team for helping us find and swat all the bugs. Thanks to Margaret for her amazing GUI designs and thanks to Mark and the investors for giving us the opportunity to work on such a cool project.