This weekend was a bit crazy for some of the AWS EC2 users. EC2’s “management software erroneously terminate[d] a small number of user’s instances” (from the AWS forum post). Some of our instances were among them providing an opportunity to test the fail-safe mechanisms in WeoCEO. We received the following email:

From: Amazon Web Services
Sent: Saturday, September 29, 2007 5:46 PM
To: David Kohler
Subject: Amazon EC2 Notification of Terminated Instances

Hello,

This is just a quick note to let you know that some of your instances were erroneously terminated today. We have resolved the underlying issue, and the service is fully available.

You can find a summary of the issue here:
http://developer.amazonwebservices.com/connect/thread.jspa?messageID=68169𐩉

These are your affected instances:
i-8004e0e9
i-681ef101

We apologize for this inconvenience.

Sincerely,

The Amazon EC2 Team

If we had not prepared for this by building WeoCEO, this could have been a real issue for us. We would have needed to scramble staff at 6 AM on a Saturday morning. Fortunately, WeoCEO recovered from the failure and it was not until Monday afternoon that we notice that it happened to a lot of other people.

From WeoCEO’s architect, Bob Banfield’s, forum post:

Here is a quick shot from our WeoCEO logs. We told WeoCEO that regardless of usage we want a minimum of two instances running, so that is the initial number of instances at 6am in the morning, even though we are receiving next to no traffic. At 6:09, i-681ef101 stops responding (the first of five allowed consecutive failures). At 6:10 it still hasn’t responded, and at 6:11 both it and instance i-52907e3b have now stopped responding. Instance i-52907e3b comes back up in another 2 minutes, but instance i-681ef101 is ruled dead after 5 failures. It is automatically terminated and a new one is brought up in its place.

(SSS) Sat Sep 29 06:07:24 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(SSS) Sat Sep 29 06:08:25 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(EEE) Sat Sep 29 06:09:25 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (1/5)
(SSS) Sat Sep 29 06:09:25 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(EEE) Sat Sep 29 06:10:25 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (2/5)
(SSS) Sat Sep 29 06:10:25 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(EEE) Sat Sep 29 06:11:26 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (3/5)
(EEE) Sat Sep 29 06:11:26 2007 Weoceo[6562]: Instance i-52907e3b has not reported statistics (1/5)
(EEE) Sat Sep 29 06:11:26 2007 Weoceo[6562]: No instances have reported statistics.
(EEE) Sat Sep 29 06:12:26 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (4/5)
(EEE) Sat Sep 29 06:12:26 2007 Weoceo[6562]: Instance i-52907e3b has not reported statistics (2/5)
(EEE) Sat Sep 29 06:12:26 2007 Weoceo[6562]: No instances have reported statistics.
(EEE) Sat Sep 29 06:13:26 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (5/5)
(EEE) Sat Sep 29 06:13:26 2007 Weoceo[11310]: Terminating i-681ef101 due to lack of statistics
(SSS) Sat Sep 29 06:13:26 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(III) Sat Sep 29 06:13:26 2007 Weoceo[6562]: Launching 1 instance(s)
(III) Sat Sep 29 06:13:26 2007 Weoceo[11310]: Terminating 1 instance
(SSS) Sat Sep 29 06:14:28 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:15:28 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:16:29 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:17:29 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:18:29 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(III) Sat Sep 29 06:19:05 2007 Weoceo[11351]: Added ID=i-94ce20fd, PublicHost=ec2-67-202-13-222.z-1.compute-1.amazonaws.com, Host=domU-12-31-36-00-1D-B4.z-1.compute-1.internal, PublicIP=67.202.13.222, IP=10.253.34.66
(SSS) Sat Sep 29 06:19:32 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(SSS) Sat Sep 29 06:20:32 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2

Email warnings were delivered to me 6am on Saturday alerting me to the problem, however I was fast asleep and WeoCEO corrected identified and corrected the problem.

We believe in the future of scalable utility computing. Dealing with events such as these is just a part of the issues with these types of systems that we’ll all have to overcome to make this future work. Our goal is that we can share what we are creating for WeoGeo in a way that helps other overcome such problems.

I do not wish to minimize the impact of this API outage, but it would be unrealistic to assume that this type of event will not happen in the future. We should all consider this in building our virtual computing architectures. The use of AWS means that you are outsourcing your metal infrastructure. This means that your system design must be organic and self-healing (see also slideshare link).

WeoCEO was built to help us at WeoGeo survive these types of outages. We are completing our private beta shortly, and are releasing the latest version of WeoCEO that we will be bringing into open beta. Contact us at WeoCEO [at] WeoGeo [dot] com if you would like to participate. Open beta will provide the stable IP addressing and recovery options for one instance for free.

Our solution is simple to use and operate, but does expect that you have some working knowledge of EC2. There are others who can help in building these types of architectures on AWS from the ground up (some of those contributed to the above AWS Forum thread including Thorsten at RightScale and Reuven at Enomaly).

Please be aware of the limitation of utility computing, as well as the promise. Planning for these outages will be a requirement for safely outsourcing your metal resources.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • NewsVine
  • Slashdot
  • StumbleUpon
  • Technorati
  • Reddit