Archive for the 'Storage' Category

Storage, Background, Remote Sensing, Hyperspectral, Amazon, WeoGeo, geospatial, grid computing, WeoCEO, mapping, WeoGeo Server

Image Processing and Delivery using Virtual Computing on EC2

I posted last week about bandwidth issues associated with geospatial data and our AWS S3 solution. The deciding factor for us to use Amazon’s offerings was not necessarily the edge distribution capabilities of S3, but the synergy from combining S3 data storage and distribution with virtual computing capabilities of EC2. There are multiple issues in image processing that require a ton of memory space and CPU horsepower. In both Market and Server, we offer the following basic map distribution options to our map providers -

Geo Clipping (6 zoom levels, allowing for ~125 million possible selections per data set)
Spatial Resampling (4 levels)
Layer Resampling (depends on data)
Output File types (5 - JPEG, GeoTIFF, ENVI, ESRI BIL, ERDAS IMG)
Projections (5 - UTM, Transverse Mercator, Lambert Conic, Albers Equal, Geographic)
Datums (3 - WGS84, NAD 83, NAD 27)

These options result in millions of possible map variants, which preclude the storage of each variant for distribution. So processing power for conversion is critical; and this processing power needs to be connected to a large, web-addressable, temporary data storage array to house the unique variant that a map user has selected. Now for a true mapping marketplace, this infrastructure needs to support 100s to possibly 1000s of simultaneous map requests from the same base map like the 40 GB image in Figure 1. Doing our NeoMapping Market correctly requires the creation of enormous processing, storage, and bandwidth infrastructure.

Figure 1. 40 GB, 156 layer HyperSpectral Imagery (HSI) map listed on WeoGeo Market. (Click on image to go to the listing in the Market).

However, who could afford that infrastructure upfront? Our original estimates for acquiring base computation needs and placing them into a co-location facility were around $500K. While not a lot of money in the scale of today’s internet operations, it was big for us. In addition, we were trying to develop the software architecture to support the Market and Server, and these expenses were large in it of themselves. AWS provided a unique and simultaneous answer to many of our immediate storage, processing, and distribution needs.

Developing our infrastructure on the scalable AWS solution allows us to say we can support the 1000s of map requests required for a functioning digital marketplace. The user experience is vital to the service’s credibility and therefore our success. However, there is a true (and in a number of cases unexpectedly high) cost in this decision. We traded high capital expenditures for high operating expenditures. In an upcoming post, I’ll talk about the Total Cost of Operations (TCO) on AWS, and some of the ways we are moving to reduce these high operating expenses through stability and scaling solutions. Some of these solutions we have turned into products that we provide to others (e.g WeoCEO)..

I would be interested in hearing about the actual experience of others on AWS and whether S3 and EC2 could or could not meet their needs.

Storage, Background, Amazon, FERI, mapping, WeoGeo Server

How do you deliver 100 40GB imagery files?

This is a bit tougher than the solution discussed in this earlier post. When we (FERI) first started developing HSI sensors and flying them for others, the distribution of imagery data was mainly through DVDs. As the research groups got larger, we started getting more and more requests for data. This eventually led to the WeoGeo Server solution, which allows for customization and asynchronous delivery.

However, 100 40GB files that look like Figure 2 in my HSI post means 4TB of data through our lab’s pipe in a relatively short period of time. Our bandwidth at the time we were trying to develop these solutions was a dedicated T1, or 1.5 mbits per second. To transfer 4 TBs of imagery files with full access of our pipe would require 259 days.

Clearly there are some solutions these days that would have helped this type of large file distribution effort. Akamai, Limelight Networks, or some bittorrent solution would provide capabilities to deliver large files over distributed networks. However, we were also providing search and customization solutions, which required modification of the data before delivery. This meant that we had a scalability problem in processing as well as delivery. Edge distribution solutions would solve one part of our problem, but not necessarily the processing part.

We began to explore co-location solutions, but these seemed to require a lot of upfront costs, as well as travel and maintenance expenses. As a small business, those capital expenditures were more than we could absorb. It was at this point that we were introduced to Amazon Web Services by a former co-worker who had been recruited by Amazon. AWS allowed us to build a distribution of large data files on top of a very large pipe via S3. (I’ll discuss the processing using EC2 later). It provided us scalable distribution at reasonable cost for those 100 40GB files.

To be honest, there are some devils in the details in using S3 for our operations. But (to date), the service has been more valuable than costly. The rapid ingestion of large files into S3 is a current problem that we are trying to solve. Moving forward we hope to build on the expansion of S3 as Amazon develops more physical data storage locations. This will provide us with some of the edge distribution advantages of the above solutions, while keeping us connected with our virtual computing solutions on EC2.

I’m also curious to see how others are using S3 in geospatial solutions; if you have a unique one, please let me know.

Storage, Background, Hyperspectral, FERI

We need more space.

We have to buy more disks today. Actually not just disks, but most probably another RAID cabinet. I work for the Florida Environmental Research Institute, which specializes in making maps from aircraft and satellites. We primarily create maps using a special form of automated feature extraction to interpret remote sensing data, or more specifically HyperSpectral Imaging (HSI). More on this subject in a later blog, but simply put, we quantitatively separate features on the ground based on their color. “Quantitatively” is important because it means we can do it on a computer, routinely and automatically, using algorithms and computer code. It’s very cool stuff; we think it will change the world. Suffice to say, this is cutting-edge mapping technology.

The problem we have is that our imaging sensors produce about 1 TB of flight operations data per day. Already massive, these data then have to be calibrated and processed to accurately map the pixel-data in the image to an exact location on the ground. Finally we stitch all of the frames or individual lines of data together into mosaics of entire areas, which then allows us to create thematic maps for our clients. At each step of this processing, we at least double the data. For every day we fly to collect data, we could end up needing at least 5 TB online for continued imagery R&D. Ouch.

We’re usually processing multiple projects from multiple flight days in our lab. As you can imagine, our data storage requirements — barely manageable for even a large organization with an infinite budget — are nearly impossible for our small non-profit. We buy disks when we need them, unfortunately many times after we need them. By this “absolute necessity” approach, we conserve money by assuring we always use the latest, most cost-effective technology.

Today, I walked back to one of our image engineers and asked if I could have the early versions of a processed data set for St. Joseph Bay, Florida (see here for a Google Earth kml link for one of the flight segments). The final processed image data for this segment was only ~70 GB, but the interim processed data was ~2 TB. My colleague told me he’d dumped the interim data because he needed the space to work on one of our other missions. It would take him several days to restore and recreate the previous products without interrupting his current process. This was definitely not what I wanted to hear.

The immediate solution is to buy more storage space (which of course requires money we have yet to procure), but what about next time? Do we always just throw money at the problem? Clearly disk space will get cheaper, but increases in productivity and efficiency, which lead to better business opportunities and greater margins, are built on creatively using today’s infrastructure for tomorrow’s solutions. Waiting until tomorrow’s technology arrives, e.g. cheaper disks, is not going to create a better business model for this field. Mapping, particularly quantitative mapping — the kind that forms the basis for financial and resource management decisions — is still prohibitively cumbersome and expensive. In order to deal with the fact that it’s plain hard and expensive to collect, manage, and distribute mapping products, this field needs to get creative.