Disaster Recovery on a Shoestring

At Loughborough we're currently gearing up to roll out the Terminal Four Site Manager CMS. We're looking at quite a nice hosting environment, with multiple front end web server VMs at separate locations with Linux Virtual Server for load balancing and failover.

This setup is great for handling problems like localised hardware failures and operating system bugs, but what happens in the event of a catastrophic failure such as the fire that destroyed the School of Electronics and Computer Science at Southampton University?  (Ballardian picture above from Dr John Bullas)

I'll blog about our wider institutional emergency planning separately.  For this post let's consider what we could do to maintain a Web presence if circumstances conspire to cut off our Internet connection, or there's a major IT systems failure.  We live in straitened times, so I'll frame this in monetary terms!



Option 1 - Dedicated server with hosting company (~£6,000/year)

Let's take a look at some sample pricing from one of the market leaders, RackSpace, for a dedicated Linux server.  This works out as around £500/month for a fairly basic server (Quad core 2.5GHz Opteron with 2GB RAM and 2 x 250GB mirrored SATA drives, with a monthly bandwidth allocation of 1TB).  So, for some £6,000/year we could could have our own DR server in the clouds.  However, our actual bandwidth use is of the order of 3.5TB/month peak outbound, so the bandwidth figure would likely be higher. I'll also note that those SATA drives might have trouble keeping up with our peak loads of some 2.5m URL requests/day.

That's if we tried to replicate full our web presence - probably a key question here.

An emergency website could consist of a small subset of the normal presence, and key information has already been identified as part of the University's emergency planning work.  However, I contend that it would be difficult to predict exactly which additional information would be needed in an emergency, and that a minimal version of the institutional site would only be appropriate for a very short period.  We also need to keep in mind that in a disaster scenario the demand for the website is likely to be significantly greater than on a typical day.

Option 2 - Community based hosting deal (£250/year to £6,500/year)

We'll look at JANET Web Hosting here although there are other community based options, notably Eduserv Hosting.  For an annual fee of some £250 per virtual machine the JANET hosting contract with RM provides 5GB of storage on a virtualized RedHat Linux or Windows platform.  There are no bandwidth charges for this service and there is no bandwidth quota at present.

However, each additional 5GB of storage is chargeable at a rate of £200/5GB. This is where the discussion about whether to replicate the full site or just specific content becomes more significant.  It's interesting to note that Loughborough's 160GB of web content would take the bill up to around £6,500/year, comparable with a dedicated server.

Option 3 - Reciprocal arrangement with peer institution (~£3,500/year)

At the recent Institutional Web Managers' Workshop 2010, UCL's Jeremy Speller spoke about emergency communications.  In addition to some trenchant observations about the potential of technology such as Twitter in an emergency, Jeremy noted that there would be some logic in institutions working together on a bilateral basis to host backup servers and services for each other. There is a degree to which this already happens with infrastructure services such as DNS secondaries and NTP peers for time synchronisation.  Some organizations have already gone further.  For example, we have a long standing agreement with several other institutions to come to each others' aid in a disaster situation. The expectation is that this would be likely to include everything from technical assistance to Internet connectivity and server hosting.

At first sight, this reciprocal option might seem like the most cost effective route to take - most of the infrastructure is already in place and paid for, after all.  However, if we are talking about a physical server then this will need to be networked, powered and cooled.  It's often observed that these ancilliary costs can exceed the price of the server hardware.  For our purposes a suitably spec'd enterprise class server (e.g. HP DL380) could bought for around £8,000, with an expected five year lifespan.  So, let's call the total cost £16,000 over five years, or some £3,500/year (£7,000 if you consider both institutions). In a bilateral agreement such as this both parties would ultimately be contributing a four figure sum, even if this was difficult to quantify due to the vagaries of utility charging, power metering etc.

Of course this figure could be reduced by re-using old hardware or using cheaper hardware, but there would be a concomitant increase in the costs to both institutions of dealing with with failures.  Staffing costs associated with a flaky system could easily dwarf facilities costs for server hardware and hosting.

The above notwithstanding, momentum in the industry is very much towards server virtualization.  This potentially offers much lower ongoing costs for hosting, and many institutions have already virtualized large proportions of their server estate.  However, one would expect that dedicated cloud hosting providers would enjoy the best economies of scale and be able to pass these on to their customers.  Let's see how this could work out in practice...  [And don't forget that we are only expecting to run our DR website live for a few days while normal service is restored!]

Option 4 - "Best of breed" cloud hosting (£100 to £5,500)

Cloud hosting tends to come in one of three flavours:
  • Software as a Service, such as Google Apps, Microsoft Live@edu and Salesforce.com - typically a subscription service delivered via a website
  • Platform as a Service, such as Google App Engine or Microsoft Windows Azure - giving you an API to write against for hosting your application in the cloud
  • Infrastructure as a Service, such as Amazon's Elastic Compute Cloud (EC2) or Rackspace Cloud - giving you a virtual machine from a library of operating systems and preconfigured appliances. Typically you are charged on a pay-as-you-go basis for bandwidth and CPU capacity, though you often have the option of pre-paying for anticipated usage at lower rates 
For this blog we'll assume that access to the underlying operating system is required in order to run scripts and manage the more complex aspects of the web server config.  This leads us in turn to Infrastructure as a Service.  We'll use Amazon EC2 as our example for this one.

A "small" EC2 Linux instance specification has enough storage (160GB) to be comparable with our main web server, although it may be resource constrained in other areas - 1.4GB RAM, CPU resource equivalent to a single core Xeon clocked at 1.2GHz.  This costs $0.11/hour while it's active.  It's presently free to upload material to EC2, but outgoing bandwidth at our typical usage rates would be charged for at $0.18/GB.  So, if we had uploaded all 160GB of content to an EC2 instance, and had to run off this site for DR purposes for about a week, our bill would be some $18 for CPU usage and $160 for bandwidth (assuming around a quarter of our monthly 3.5TB data transferred) - or just over £100 at current exchange rates.

Now for comparative purposes let's imagine that we wanted to run our web server off EC2 full time, 24x7x365.  It's possible to pay Amazon upfront for a "reserved instance", which dramatically reduces the cost per CPU hour.  If our peak time bandwidth requirements were maintained, the total for this would be around £5,500/year, so comparable with (if not slightly cheaper than) a more traditional hosting approach.  [All that's missing from the picture here is an academic discount rate ;-]

I'll note in passing that exchange rates could change dramatically, in our favour or against us, and that the true picture for Amazon is a little more complex - e.g. for persistence, storage would likely be done via Elastic Block Store.   The Amazon offering is also particularly interesting given the recent availability of the Amazon Virtual Private Cloud, which allows you to host institutional IP addresses in the cloud via an IPSEC tunnel between Amazon and your organization.

We'll be aiming to trial some of these options in the near future, so watch this space for further developments...

3 comments:

  1. Hi Martin - great post. Just thinking about your £100 disaster recovery solution though. Surely you would need to maintain the 160GB of data within the cloud as it would not be reasonable to assume that you could quickly (or in fact even at all) upload the website data to Amazon. This would increase this disaster recovery solution by the amount it cost to store 160GB permanently (and presumably bandwidth to keep it updated and a brief server instance - eg for 2-3 hours per day or week - to run the rsync or whatever other sync tool you'd use).

    ReplyDelete
  2. That's quite right - I intentionally oversimplified the picture, and the reality would be that you would need Elastic Block Storage (EBS) in addition to an EC2 instance. Blurb from Amazon says... "Volume storage is charged by the amount you allocate until you release it, and is priced at a rate of $0.10 per allocated GB per month Amazon EBS also charges $0.10 per 1 million I/O requests you make to your volume" so $16/month to keep that 160GB stored + transaction related costs. More at http://aws.amazon.com/ebs/

    The good news is that I've analyzed the situation a bit more and we only need to maintain 100GB of content :-) The remainder was archived material, old log files etc.

    However, my suspicion is that Elastic Load Balancing will also be necessary, as there are some dire warnings in the EC2 bumpf about not relying on persistence of EC2 hosted VM's IP addresses. Costings for ELB are detailed here: http://aws.amazon.com/elasticloadbalancing/

    Haven't had a chance to make as much headway on this as I would have liked, as it has taken a while to get my University purchasing card sorted out - but that's done now. Yes, it's all linked back to your credit card number :-)

    On the JANET Web Hosting front I believe this is only going to be of very limited use to us, as there is no access to the underlying operating system and the Web control panel for the virtual host would not let us replicate our main Web server config. Bit of a shame, that, but my feeling is that it could still come in handy in a true disaster that knocked out both our primary and secondary sites - or if our fibre links back to the MAN were severed in a way that would take a significant period to repair.

    I'll put together another blog post once I have had a chance to experiment with the EC2 stuff...

    ReplyDelete
  3. Hi, that's right - I'm oversimplifying a little.

    For a functional setup I think Elastic Block Storage and Elastic Load Balancer would be required.

    http://aws.amazon.com/ebs/:
    "Volume storage is charged by the amount you allocate until you release it, and is priced at a rate of $0.10 per allocated GB per month Amazon EBS also charges $0.10 per 1 million I/O requests you make to your volume"

    http://aws.amazon.com/ec2/faqs/:
    "By default, every instance comes with a private IP address and an internet routable public IP address. These addresses are fixed for the life of the instance. These IP addresses should be adequate for many applications where you do not need a long lived internet routable end point. Compute clusters, web crawling, and backend services are all examples of applications that typically do not require Elastic IP addresses."

    "In order to help ensure our customers are efficiently using the Elastic IP addresses, we impose the $.01/hr charge for each address when it is not mapped to an instance."

    I'll come back to this in a subsequent post!

    ReplyDelete