Archive for the ‘grid computing’ Category

The Internet Is Not the Computer – Yet

Tuesday, October 6th, 2009

Has an IT professional in your organization told you that Software-as-a-Service (SaaS) just does not work for your enterprise?  I am going to help their argument; then I’ll tear it down.

The biggest problem for a professional engineering organization with SaaS is latency.  Also called disk-access-time , data transfer rate, bit rate.  Rapid communication between processing units and data storage pools is critical for today’s professional computing efforts.  If you don’t believe me, try opening (and editing) a large GIS or CAD file via a VPN from a network storage device operating in a satellite location from across the country.

For example – Our realized transfer rates from our facility in Portland to Amazon Web Services (AWS) in Virginia is 3.5 megabits per second (mb/s).  We have a 100 megabit per second dedicated service pipe to our facility.  This latency is not Amazon’s fault, but rather it is a function of the speed of light and the number of Internet “hops” or network transfer points between WeoGeo and AWS.  From Florida, we average about 7 mb/s to the same facility.  This issue is one of the fundamental reasons AWS released their Import/Export feature to S3.

To put this into perspective, in a typical disk drive on a desktop computer (circa 2008) operating at 7200 rpm operates at a disk access transfer speed of 560 – 2400 mb/s (70 – 300 megabytes per second).  That is 2 – 3 orders of magnitude faster than accessing the same file via the Internet.  The last time desktop computer users “suffered” thru 5 mb/s disk access transfer speeds was in 1987 when IBM released the PS/2 with a 5 MegaByte ST-506 Seagate Technologies hard drive.

When will latency be reduced to the point where we might consider “the internet is the computer”?  That is kind of hard to say.  Even with ubiquitous broadband access at >100 mb/s to all business, we will stuff suffer the “speed of light” problems – photons can only go so fast.  In addition, every time you have to go through an Internet junction or telecommunication switch (i.e. hops) you will increase the packet transfer times.  A report from the National Broadband Coalition (also covered here) suggests the pipe speeds to small and medium business will not approach those of hard disk drives until sometime between 2015 and 2020 (see table below).

This suggests that the internet-as-the-computer to replace your desktop is still some time away, maybe as long as 3 – 5 years.  This latency argument is what many in corporate IT departments would use to strike against SaaS, PaaS, or IaaS services within your organizations.  (I am purposely ignoring security, but will address this issue in another post).  I would counter that those arguments are very similar to the ones once used against the IBM PS/2 as a corporate workhorse back-in-the-day; and ultimately I believe they are rooted more in bureaucratic inertia than true cost/benefit analysis to the enterprise.

Internet-based services have a place in today’s enterprise environment.  Depending on the use case, they can be more efficient in managing your limited IT dollars.  These services can also provide greater, timelier, software support, which together with the lower costs, increases the bottom-line productivity of your organization.  More importantly, these managed, off-site, services will have a greater place in your organization tomorrow.  Data transfer speeds will increase to a point that the latency issues will be negligible for the services you require.  As a decision maker in your organization – are you planning for greater productivity and enhanced profits tomorrow, or adding to the “mainframe” infrastructure of today?

Scaling FME Engines on WeoGeo

Friday, June 19th, 2009

I presented the movie below as part of a presentation at the Safe Software FME User Conference. We had a great time and the Safe crew put on a marvelous show.

The movie shows WeoGeo scaling up to 64 Safe Software distributed FME Engines in the production of tile caches from a world-wide elevation database. The FME Workspace script was created by Dmitri Bagh, and processed on WeoGeo’s FME Constellation built on Amazon Web Services.

The scaling occurred automatically, spinning up FME Engine AMIs, and then shutting them down when the job queue was completed. This is one of our first examples of bringing scalable processing to difficult geospatial tasks.

Examples of the tiles created by Dmitri’s script for Virtual Earth (Bing Maps for Enterprise) and Google Earth can be found here.

Panel 1 (upper left hand corner) refers to the total number of engines in the constellation processing job.

Panel 2 (upper right hand corner) refers to the total constellation utilization percentage. The constellation is polled and when the utilization exceeds the pre-set threshold (50% in this example), it increases (doubles here) the number of engines until it reaches the pre-set maximum number of engines (64 here). The downward spikes occur when each new set of engines are added.

Panel 3 (lower left hand corner) is the average job processing time. There is an increase in velocity when the number of engines exceeds 16, which may be a function of increased overhead costs on the FME Core or bandwidth to the database.

Panel 4 (lower right hand corner) is the total number of jobs completed. 2000 jobs were submitted for this test. The job completion rate accelerates until the maximum number of engines are brought on-line.

Amazon Web Services EC2 Outage

Monday, October 1st, 2007

This weekend was a bit crazy for some of the AWS EC2 users. EC2’s “management software erroneously terminate[d] a small number of user’s instances” (from the AWS forum post). Some of our instances were among them providing an opportunity to test the fail-safe mechanisms in WeoCEO. We received the following email:

From: Amazon Web Services
Sent: Saturday, September 29, 2007 5:46 PM
To: David Kohler
Subject: Amazon EC2 Notification of Terminated Instances

Hello,

This is just a quick note to let you know that some of your instances were erroneously terminated today. We have resolved the underlying issue, and the service is fully available.

You can find a summary of the issue here:

http://developer.amazonwebservices.com/connect/thread.jspa?messageID=6816

These are your affected instances:
i-8004e0e9
i-681ef101

We apologize for this inconvenience.

Sincerely,

The Amazon EC2 Team

Please be aware of the limitation of utility computing, as well as the promise. Planning for these outages will be a requirement for safely outsourcing your metal resources.

If we had not prepared for this by building WeoCEO, this could have been a real issue for us. We would have needed to scramble staff at 6 AM on a Saturday morning. Fortunately, WeoCEO recovered from the failure and it was not until Monday afternoon that we notice that it happened to a lot of other people.

From WeoCEO’s architect, Bob Banfield’s, forum post:

Here is a quick shot from our WeoCEO logs. We told WeoCEO that regardless of usage we want a minimum of two instances running, so that is the initial number of instances at 6am in the morning, even though we are receiving next to no traffic. At 6:09, i-681ef101 stops responding (the first of five allowed consecutive failures). At 6:10 it still hasn’t responded, and at 6:11 both it and instance i-52907e3b have now stopped responding. Instance i-52907e3b comes back up in another 2 minutes, but instance i-681ef101 is ruled dead after 5 failures. It is automatically terminated and a new one is brought up in its place.

(SSS) Sat Sep 29 06:07:24 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(SSS) Sat Sep 29 06:08:25 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(EEE) Sat Sep 29 06:09:25 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (1/5)
(SSS) Sat Sep 29 06:09:25 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(EEE) Sat Sep 29 06:10:25 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (2/5)
(SSS) Sat Sep 29 06:10:25 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(EEE) Sat Sep 29 06:11:26 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (3/5)
(EEE) Sat Sep 29 06:11:26 2007 Weoceo[6562]: Instance i-52907e3b has not reported statistics (1/5)
(EEE) Sat Sep 29 06:11:26 2007 Weoceo[6562]: No instances have reported statistics.
(EEE) Sat Sep 29 06:12:26 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (4/5)
(EEE) Sat Sep 29 06:12:26 2007 Weoceo[6562]: Instance i-52907e3b has not reported statistics (2/5)
(EEE) Sat Sep 29 06:12:26 2007 Weoceo[6562]: No instances have reported statistics.
(EEE) Sat Sep 29 06:13:26 2007 Weoceo[6562]: Instance i-681ef101 has not reported statistics (5/5)
(EEE) Sat Sep 29 06:13:26 2007 Weoceo[11310]: Terminating i-681ef101 due to lack of statistics
(SSS) Sat Sep 29 06:13:26 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(III) Sat Sep 29 06:13:26 2007 Weoceo[6562]: Launching 1 instance(s)
(III) Sat Sep 29 06:13:26 2007 Weoceo[11310]: Terminating 1 instance
(SSS) Sat Sep 29 06:14:28 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:15:28 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:16:29 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:17:29 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(SSS) Sat Sep 29 06:18:29 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 1
(III) Sat Sep 29 06:19:05 2007 Weoceo[11351]: Added ID=i-94ce20fd, PublicHost=ec2-67-202-13-222.z-1.compute-1.amazonaws.com, Host=domU-12-31-36-00-1D-B4.z-1.compute-1.internal, PublicIP=67.202.13.222, IP=10.253.34.66
(SSS) Sat Sep 29 06:19:32 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2
(SSS) Sat Sep 29 06:20:32 2007 Weoceo[6562]: Overall usage = 0% NumInstances = 2

Email warnings were delivered to me 6am on Saturday alerting me to the problem, however I was fast asleep and WeoCEO corrected identified and corrected the problem.

We believe in the future of scalable utility computing. Dealing with events such as these is just a part of the issues with these types of systems that we’ll all have to overcome to make this future work. Our goal is that we can share what we are creating for WeoGeo in a way that helps other overcome such problems.

I do not wish to minimize the impact of this AWS outage, but it would be unrealistic to assume that this type of event will not happen in the future. We should all consider this in building our virtual computing architectures. The use of AWS means that you are outsourcing your metal infrastructure. This means that your system design must be organic and self-healing (see also slideshare link).

Our solution is simple to use and operate, but does expect that you have some working knowledge of EC2. There are others who can help in building these types of architectures on AWS from the ground up (some of those contributed to the above AWS Forum thread including Thorsten at RightScale and Reuven at Enomaly).

WeoCEO was built to help us at WeoGeo survive these types of outages. We are completing our private beta shortly, and are releasing the latest version of WeoCEO that we will be bringing into open beta. Contact us at WeoCEO [at] WeoGeo [dot] com if you would like to participate. Open beta will provide the stable IP addressing and recovery options for one instance for free.

Image Processing and Delivery Using Virtual Computing on EC2

Thursday, September 6th, 2007

I posted last week about bandwidth issues associated with geospatial data and our AWS S3 solution. The deciding factor for us to use Amazon’s offerings was not necessarily the edge distribution capabilities of S3, but the synergy from combining S3 data storage and distribution with virtual computing capabilities of EC2. There are multiple issues in image processing that require a ton of memory space and CPU horsepower. In both Market and Server, we offer the following basic map distribution options to our map providers -

Geo Clipping (6 zoom levels, allowing for ~125 million possible selections per data set)
Spatial Resampling (4 levels)
Layer Resampling (depends on data)
Output File types (5 – JPEG, GeoTIFF, ENVI, ESRI BIL, ERDAS IMG)
Projections (5 – UTM, Transverse Mercator, Lambert Conic, Albers Equal, Geographic)
Datums (3 – WGS84, NAD 83, NAD 27)

These options result in millions of possible map variants, which preclude the storage of each variant for distribution. So processing power for conversion is critical; and this processing power needs to be connected to a large, web-addressable, temporary data storage array to house the unique variant that a map user has selected. Now for a true mapping marketplace, this infrastructure needs to support 100s to possibly 1000s of simultaneous map requests from the same base map like the 40 GB image in Figure 1. Doing our NeoMapping Market correctly requires the creation of enormous processing, storage, and bandwidth infrastructure.

Figure 1. 40 GB, 156 layer HyperSpectral Imagery (HSI) map listed on WeoGeo Market. (Click on image to go to the listing in the Market).

However, who could afford that infrastructure upfront? Our original estimates for acquiring base computation needs and placing them into a co-location facility were around $500K. While not a lot of money in the scale of today’s internet operations, it was big for us. In addition, we were trying to develop the software architecture to support the Market and Server, and these expenses were large in it of themselves. AWS provided a unique and simultaneous answer to many of our immediate storage, processing, and distribution needs.

Developing our infrastructure on the scalable AWS solution allows us to say we can support the 1000s of map requests required for a functioning digital marketplace. The user experience is vital to the service’s credibility and therefore our success. However, there is a true (and in a number of cases unexpectedly high) cost in this decision. We traded high capital expenditures for high operating expenditures. In an upcoming post, I’ll talk about the Total Cost of Operations (TCO) on AWS, and some of the ways we are moving to reduce these high operating expenses through stability and scaling solutions. Some of these solutions we have turned into products that we provide to others (e.g WeoCEO)..

I would be interested in hearing about the actual experience of others on AWS and whether S3 and EC2 could or could not meet their needs.

40 GB Imagery File Redux

Wednesday, August 29th, 2007

An obvious question that drops out of yesterday’s post on the right file format to use to distribute large raster files is, “How do you distribute a 40 GB file?” The distribution of a single 40 GB file would overwhelm the bandwidth of many small businesses. That was one of the reasons we originally developed the WeoGeo Server.

Figure 1. WeoGeo Server (click on the image to see more information)

The Server allows the mapping organization to distribute customer-defined customized products that would reduce the required file size, and thus bandwidth, to satisfy their customers’ demand. However, there is still the use case where the customer wants the whole file.

Since FERI is a small business, we couldn’t have our daily research activities impacted by an imagery request. So the first (obvious) step was to develop a customization and distribution system that processes a data request in an asynchronous manner, i.e. the order is taken during business hours, but it is processed and delivered after business hours. This allowed us to optimize our bandwidth in our labs and still reasonably satisfy customer demands (assuming they did not need instantaneous data delivery). We also tweaked the system to allow some small files and all of our own requests to be processed immediately, while larger ones for external users were processed in the evenings.

The asynchronous data delivery is also a fundamental difference between our technology and online GIS servers. We optimized for discovery, customization, and ordering in a way that allows the customer to receive near-instant gratification on the discovery and ordering, while (possibly) delaying gratification on the delivery.

While the customization of product selection and the asynchronous processing and delivery bought us some additional help in terms of distributing large geospatial content files, it still did not help us with the problem of what to do with multiple requests for 40 GB image files. This is where some of my earlier posts, where I described our use of Amazon Web Services, begin to make some sense (and maybe why Jinesh digs what we are doing).

However, I am late for dinner, so I’ll pick up this theme on a later post…

Commodity Computing Cycles (C3) and ETech

Saturday, March 24th, 2007

I was preparing for ETech and ran across Jeff Barr’s recent AWS blog.  He points to a number of interesting links, including WeoCEO’s new website (thanks Jeff!). 

One of the links he points to is David Berlind’s video on “Is it time to throw away your servers?“.  It was a highly entertaining video, but more importantly it clearly laid out the business case for why cluster and grid computer is going to revolutionize this business.  We must be channeling the same psychic hotline, because it mirrors the case I laid out in the Cycles in the Sky blog earlier this week.  (However, David’s is far more entertaining, with real numbers.)

Commodity Computing Cycles (C3) is a paradigm shift in business computing.  It is coming, and to be honest, I have no way to predict the impact of the change on efficiency and productivity in the business computer arena.  I do know that in order for it to achieve its potential, those of us focusing on cluster and grid computing have to deliver some sort of Service Level Agreement (SLA).  While David points to the cost advantages, what he did not point out is the lack of an SLA from Amazon.  Someone running an ecommerce site may willingly pay the additional money shown in David’s video for a traditional data center operation, if they can be assured of up-time and bandwidth.  Without these assurances, the dollar savings obtained by using a C3 solution may be given back in poor user experience or web client customer service.

That being said, I think that we (the greater community of Amazon Web Services and EC2 users) are working towards achieving reasonable service levels upon which we can build ecommerce solutions.  We developed our WeoCEO ISO because it was required in order to host our WeoGeo geospatial exchange on EC2.  There are other service issues, such as large file ingestion (imagine trying to push a terabyte size file up to S3!), but we are confident that these too can be overcome and solutions delivered to the community.  I truly believe that the revolution is here, and like any other paradigm shifts, there will be a tremendous opportunity for those willing to place their stakes in the ground to deliver solutions to those who follow.
 

On other notes, I will be taking my soapbox to ETech next week.  Find me if you would like to chat about such things as revolutions and paradigm shifts in cluster and grid computing, as well as geospatial technologies.

Cycles in the Sky

Wednesday, March 21st, 2007

There is a revolution happening quietly in the development of web computing infrastructure. To those who have been involved in the development of large scale distributed computing, i.e., cluster and grid computing, the concepts and applications of the revolution are decades old. To the computation science community, including weather forecasters, climate change scientists, numerical ecologists, artificial intelligence experts, bomb developers, etc., these types of efforts have been at the forefront of super computing technology development. I have been involved in various types of cluster computers for oceanic ecological modeling, and some of our collaborators at Rutgers University are experts in the field of distributed computing for coupled atmospheric and oceanic modeling. However, for the average person, terms like cluster or grid computing have little or no tactile meaning. Perhaps a few could tie it to the SETI grid computing project, but even these few might not understand the implications at the average business or consumer level.

One of the favorite terms today for large scale distributed computing at the business and consumer level is “Web-Scale Computing”. You see it in the sessions for a couple of the O’Reilly Conferences (ETech and Web 2.0 Expo) mainly discussing Amazon Web Services’ (AWS) EC2 and S3 services. AWS is one of the first mainstream applications that put the power of cluster computing into the hands of commercial web application developers. With these services, and those that will surely follow, we as a society/culture/business community move one step closer to the concept of on-demand purchasing of computer cycles, and the development of markets in these cycles.

These services, what I will refer to as Commodity Computer Cycles (C3), are different from commodity computing. In commodity computing you are still responsible for assembling the components and the network of processors into a cluster for your distributed processing application. You are still required to pay for the power, cooling and maintenance, as well as, the personnel involved in development, care, and security of the systems. These expenses are upfront and continuing throughout the life of the business, regardless of total computational use. With C3, you buy the FLOPS or the storage space needed for your application, on-demand.

With C3, your business can then focus on developing better applications and services for your customers, rather than the development of the in-house infrastructure to rack, cool, and take care of your computers. If the outsourcing of FLOPS and storage makes business sense (which I truly believe it does), we should expect that the demand of C3 services will increase, leading to the building of more C3 infrastructure and therefore feeding virtuously into the creations of evermore efficient web applications and services. If the revolution seriously takes roots and spreads across the whole of the business and consumer communities, it will affect us all.

To bring this discussion back home, we at WeoGeo are trying to change the dynamics of quantitative mapping. Our own maps are terabytes in size, and require petaFLOPS of processing. We developed our web exchange and server application on EC2 and S3 for many reasons, including the costs associated with the growth in our computing needs and the requirement to automatically scale as a function of computing cycle demand. In addition, by developing on a fully scalable C3 model (see below), we could pass the infrastructure savings directly to our user community. This should help enable them to develop new markets for their mapping products and hopefully lead them into a new model for generating revenue in the field of geospatial maps, services, and technologies.

The EC2 version of C3 marks the beginning of the widespread commercial use of on-demand distributed computing. I believe it is a harbinger of things to come. For our purposes, EC2 was not quite ready for prime time and we had to overlay additional intelligent management software to provide stability and optimized scaling to take full advantage of the C3 potential offered by AWS (see this WeoCEO blog post, as well as this AWS forum post by Robert Banfield). I am sure that our solution is but one of the first of many to come. The important thing to recognize is that the delivery of scalable, fully optimized, Commodity Computing Cycles is happening right now and will only get better, easier, and cheaper with time. I believe that the next phase of productivity enhancement in the business and consumer markets begins now, and it is truly exciting to be a part of this wave.