Expressions of Intent for International Polar Year 2007-2008 Activities

Expression of Interest Details


PROPOSAL INFORMATION

(ID No: 169)

Management and Preservation Techniques for Disk-Based Archival Storage to Support Polar Exploration  (Cold Storage)

Outline
Data gathered by International Polar Year projects must be maintained for decades, but doing so presents a major challenge. For example, data gathered in 1957 would have been written on reel-to-reel tapes, and would be unreadable if it were still on the original media; however, constantly moving data to new media is expensive and labor-intensive. Moreover, actually finding a particular data set from 1957 would also be difficult. The Cold Storage project will address these issues by using intelligent "bricks" to store data for the long term. By adding a low-cost CPU and network interface to commodity disk drives, we can build a very low-cost replicable unit from which we can construct an inexpensive large-scale data archive. Most of the disks will be powered off most of the time, saving power and reducing cooling costs. This system will react to disk failure by replicating the data onto other drives already in the system; the only storage management necessary will be the connection of new drives when the available capacity becomes too low for the volume of data being stored. Protecting data against disk failure and distributing and locating data in the system are the primary challenges for such a long-term archive; these challenges are made more difficult by the need to keep most of the disks powered off most of the time. The second area addressed by Cold Storage is the need to find useful information in a multi-terabyte (or even petabyte-scale) archive. Current indexing structures are not well-suited for such large data systems, and typically require all of the storage devices to be active for searching. Cold Storage will explore rich metadata and links between data sets, along with extensive indexing, to allow users to navigate stored data. While this approach is similar to that used in the Web, it will benefit both from the rich metadata associated with files and links as well as locality-based searching, in which the system can give more weight to files that are related via shorter sets of links. Building such a locality-capable index, and ensuring that it is usable, scalable, and functional in the face of individual component failures is the second component of the Cold Storage project. The result of this project will be technology that allows the construction of inexpensive, easy-to-maintain long-term archives using commodity components. Using this technology, results from IPY science can be more easily managed and disseminated, furthering IPY goals.

Theme(s)   Major Target
  Data Management

What significant advance(s) in relation to the IPY themes and targets can be anticipated from this project?
This project will advance the technology needed to maintain IPY data for the next several decades, both improving functionality and reducing cost to do so. The use of commodity components for long-term archiving and the development of preservation and indexing techniques that require minimal human supervision will lead to reduced costs and make it possible for a long-term archive to provide IPY data for decades to come.

What international collaboration is involved in this project?
There is no explicit need for international collaboration on this project. However, it is expected that researchers of any nationality will be able to use the systems developed to store and retrieve data.


FIELD ACTIVITY DETAILS

Geographical location(s) for the proposed field activities:
Since the focus of this project is data management, field activities may not be necessary. Rather, data management can be done at non-polar locations with networks moving data to and from the field.

Approximate timeframe(s) for proposed field activities:
Arctic: n/a
Antarctic: n/a

Significant facilities will be required for this project:
This project will require a large number of hard drives and associated CPUs, along with the network infrastructure to allow communication among the devices and between the devices and the sensors / researchers. Much of the communication infrastructure will be in place for other uses.

Will the project leave a legacy of infrastructure?
The goal of the project is to leave a legacy storage system that will allow for simple, cheap maintenance of data for decades after the IPY completes. This infrastructure may be located at any convenient area.

How is it envisaged that the required logistic support will be secured?
National agency

Has the project been "endorsed" at a national or international level?
The project has not been endorsed at a national or international level at this time.


PROJECT MANAGEMENT AND STRUCTURE

Is the project a short-term expansion (over the IPY 2007-2008 timeframe) of an existing plan, programme or initiative or is it a new autonomous proposal?
New

How will the project be organised and managed?
The project will be managed by the Storage Systems Research Center (SSRC) at the University of California, Santa Cruz. This research group consists of about ten faculty and post-doctoral fellows and 20-25 graduate students. The SSRC has experience managing large projects; its current annual budget exceeds $600,000. If necessary, a project manager will be hired, though this may not be necessary.

What are the initial plans of the project for addressing the education, outreach and communication issues outlined in the Framework document?
The project will be central to the communication and education missions of the IPY. By making data more easily and more cheaply available to the scientific community, Cold Storage will help make the IPY more useful to the scientific community. In addition, it will bridge the computer science and geophysics communities.

What are the initial plans of the project to address data management issues (as outlined in the Framework document?
The focus of the project is data management; please see the overall description for details.

How is it proposed to fund the project?
Funding may come from several sources: US government funding agencies (NSF and others) as well as potential industrial collaborators.

Is there additional information you wish to provide?
None


PROPOSER DETAILS

Pro Ethan Miller
Computer Science Department
University of California / 1156 High Street, MS SOE3
Santa Cruz, CA
95064
USA

Tel: +1 831 459-1222
Mobile: no
Fax: +1 831-459-4829
Email:

Other project members and their affiliation

Name   Affiliation
Darrell Long   University of California, Santa Cruz
Scott Brandt   University of California, Santa Cruz
Ahmed Amer   University of Pittsburgh