Posts Tagged ‘WeoGeo Server’

Follow-Up to Direction Magazine’s Podcast on WeoGeo

Tuesday, October 2nd, 2007

Adena Schutzberg did a podcast with me last week on the business model for WeoGeo. It was my first podcast and I hope that I made sense to people (I welcome comments and/or critiques in the comments section here). I would like to thank Adena for giving us the opportunity to tell our story.

However, I am not sure I was as clear as I could have been about our history and the importance that history in the development of WeoGeo. I could not quite put my finger on what was missing until after the AWS StartUp Event – Boston (see here as well for my comments) when someone asked how many man-years of effort went into developing the site.

My first response was to take the number of years that FERI was in operation times the number of people involved at FERI. Kind of silly, I know. But when I think of why we built WeoGeo, this response seems relevant. Their response, of course, was, “no really, how much technical development time?” I understood the question; the person was trying to ascertain how difficult it would be to recreate what we are doing.

Our technical development on this project did start back around 2001 with a project called Hyperspectral Data Repository On-line (HyDRO). This was our first distribution system, developed to help alleviate the problems associated in delivering HSI data to our customers. This concept and technology eventually evolved into the WeoGeo Server (see post here as well). Between 2001 and 2005 we had 4 PhDs and masters-trained personal spending a portion of their time on HyDRO because it was a critical element of our research programs. In the last couple of years, we increased the number of people working on WeoGeo Market/Server, to >12 currently if you include outside contractors. For the most part, they are highly trained GIS and MIS/CIS/CS personnel.

The technology is hot, no question about it. I am amazed on a daily basis what our group of people has developed for mapping on both commodity computers and utility computing systems. Yet, here is the rub to this type of man-years calculation. I really believe that the reasons for WeoGeo, and its associated development time, stem from our history at FERI, which makes such calculations difficult. The “technical development time” is not just time spent coding; it includes the needs assessment and the development of the system architecture to address critical problems and/or pain. What we have developed at WeoGeo is a direct function of two critical needs of our operations as a research and imagery services organization.

These two critical needs were (and still are):
1) Delivery of our survey grade, high volume mapping content;
2) Finding and acquiring other survey grade mapping content to fuse with ours to create value-added geocontent for our clients.

WeoGeo was built to solve these two critical problems (there are others, but not nearly as critical to our organization as these). If you have never been faced with these problems, then you might not appreciate the depth of the solutions we have built to service these needs (and its potential). But if you have, then you have felt our pain – and I hope value our solution.

Image Processing and Delivery Using Virtual Computing on EC2

Thursday, September 6th, 2007

I posted last week about bandwidth issues associated with geospatial data and our AWS S3 solution. The deciding factor for us to use Amazon’s offerings was not necessarily the edge distribution capabilities of S3, but the synergy from combining S3 data storage and distribution with virtual computing capabilities of EC2. There are multiple issues in image processing that require a ton of memory space and CPU horsepower. In both Market and Server, we offer the following basic map distribution options to our map providers -

Geo Clipping (6 zoom levels, allowing for ~125 million possible selections per data set)
Spatial Resampling (4 levels)
Layer Resampling (depends on data)
Output File types (5 – JPEG, GeoTIFF, ENVI, ESRI BIL, ERDAS IMG)
Projections (5 – UTM, Transverse Mercator, Lambert Conic, Albers Equal, Geographic)
Datums (3 – WGS84, NAD 83, NAD 27)

These options result in millions of possible map variants, which preclude the storage of each variant for distribution. So processing power for conversion is critical; and this processing power needs to be connected to a large, web-addressable, temporary data storage array to house the unique variant that a map user has selected. Now for a true mapping marketplace, this infrastructure needs to support 100s to possibly 1000s of simultaneous map requests from the same base map like the 40 GB image in Figure 1. Doing our NeoMapping Market correctly requires the creation of enormous processing, storage, and bandwidth infrastructure.

Figure 1. 40 GB, 156 layer HyperSpectral Imagery (HSI) map listed on WeoGeo Market. (Click on image to go to the listing in the Market).

However, who could afford that infrastructure upfront? Our original estimates for acquiring base computation needs and placing them into a co-location facility were around $500K. While not a lot of money in the scale of today’s internet operations, it was big for us. In addition, we were trying to develop the software architecture to support the Market and Server, and these expenses were large in it of themselves. AWS provided a unique and simultaneous answer to many of our immediate storage, processing, and distribution needs.

Developing our infrastructure on the scalable AWS solution allows us to say we can support the 1000s of map requests required for a functioning digital marketplace. The user experience is vital to the service’s credibility and therefore our success. However, there is a true (and in a number of cases unexpectedly high) cost in this decision. We traded high capital expenditures for high operating expenditures. In an upcoming post, I’ll talk about the Total Cost of Operations (TCO) on AWS, and some of the ways we are moving to reduce these high operating expenses through stability and scaling solutions. Some of these solutions we have turned into products that we provide to others (e.g WeoCEO)..

I would be interested in hearing about the actual experience of others on AWS and whether S3 and EC2 could or could not meet their needs.

How Do You Deliver 100 40GB Imagery Files?

Thursday, August 30th, 2007

This is a bit tougher than the solution discussed in this earlier post. When we (FERI) first started developing HSI sensors and flying them for others, the distribution of imagery data was mainly through DVDs. As the research groups got larger, we started getting more and more requests for data. This eventually led to the WeoGeo Server solution, which allows for customization and asynchronous delivery.

However, 100 40GB files that look like Figure 2 in my HSI post means 4TB of data through our lab’s pipe in a relatively short period of time. Our bandwidth at the time we were trying to develop these solutions was a dedicated T1, or 1.5 mbits per second. To transfer 4 TBs of imagery files with full access of our pipe would require 259 days.

Clearly there are some solutions these days that would have helped this type of large file distribution effort. Akamai, Limelight Networks, or some bittorrent solution would provide capabilities to deliver large files over distributed networks. However, we were also providing search and customization solutions, which required modification of the data before delivery. This meant that we had a scalability problem in processing as well as delivery. Edge distribution solutions would solve one part of our problem, but not necessarily the processing part.

We began to explore co-location solutions, but these seemed to require a lot of upfront costs, as well as travel and maintenance expenses. As a small business, those capital expenditures were more than we could absorb. It was at this point that we were introduced to Amazon Web Services by a former co-worker who had been recruited by Amazon. AWS allowed us to build a distribution of large data files on top of a very large pipe via S3. (I’ll discuss the processing using EC2 later). It provided us scalable distribution at reasonable cost for those 100 40GB files.

To be honest, there are some devils in the details in using S3 for our operations. But (to date), the service has been more valuable than costly. The rapid ingestion of large files into S3 is a current problem that we are trying to solve. Moving forward we hope to build on the expansion of S3 as Amazon develops more physical data storage locations. This will provide us with some of the edge distribution advantages of the above solutions, while keeping us connected with our virtual computing solutions on EC2.

I’m also curious to see how others are using S3 in geospatial solutions; if you have a unique one, please let me know.

40 GB Imagery File Redux

Wednesday, August 29th, 2007

An obvious question that drops out of yesterday’s post on the right file format to use to distribute large raster files is, “How do you distribute a 40 GB file?” The distribution of a single 40 GB file would overwhelm the bandwidth of many small businesses. That was one of the reasons we originally developed the WeoGeo Server.

Figure 1. WeoGeo Server (click on the image to see more information)

The Server allows the mapping organization to distribute customer-defined customized products that would reduce the required file size, and thus bandwidth, to satisfy their customers’ demand. However, there is still the use case where the customer wants the whole file.

Since FERI is a small business, we couldn’t have our daily research activities impacted by an imagery request. So the first (obvious) step was to develop a customization and distribution system that processes a data request in an asynchronous manner, i.e. the order is taken during business hours, but it is processed and delivered after business hours. This allowed us to optimize our bandwidth in our labs and still reasonably satisfy customer demands (assuming they did not need instantaneous data delivery). We also tweaked the system to allow some small files and all of our own requests to be processed immediately, while larger ones for external users were processed in the evenings.

The asynchronous data delivery is also a fundamental difference between our technology and online GIS servers. We optimized for discovery, customization, and ordering in a way that allows the customer to receive near-instant gratification on the discovery and ordering, while (possibly) delaying gratification on the delivery.

While the customization of product selection and the asynchronous processing and delivery bought us some additional help in terms of distributing large geospatial content files, it still did not help us with the problem of what to do with multiple requests for 40 GB image files. This is where some of my earlier posts, where I described our use of Amazon Web Services, begin to make some sense (and maybe why Jinesh digs what we are doing).

However, I am late for dinner, so I’ll pick up this theme on a later post…