Large-scale Data Replication for LIGO

The Laser Interferometer Gravitational Wave Observatory (LIGO) is a multi-site national research facility whose objective is the detection of gravitational waves. LIGO consists of two facilities, one in Livingston, LA, and one in Richland, WA, operated jointly by the California Institute of Technology (CalTech) and the Massachusetts Institute of Technology (MIT). The LIGO detectors at these two sites work together to detect gravitational waves: tiny distortions of space and time caused when very large masses, such as stars, move suddenly. The LIGO detectors are used together and in cooperation with detectors in other countries to look for coincident signals that may be indications of gravitational waves.

The Challenge

The LIGO facility includes three interferometers at the two sites and records thousands of channels at several sampling rates. This results in approximately one terabyte (1 TB = 1,024 GB = 1,048,576 MB) of data every day during a detection run. During the commissioning phase of the LIGO interferometers, these runs typically lasted for 2-8 weeks. LIGO expects to begin a one-year detection run late in 2005. Because the LIGO detector sites are remote, the data was originally stored on tapes and shipped to CalTech where it was made available online to scientists. More recently, upgraded network connections have allowed replication to other data centers directly from the detector sites.

The data generated by a LIGO run must be scientifically analyzed for it to be of any value. The analysis is computationally intensive (some classes of astrophysical searches can require hundreds of Teraflops), and the data volume itself is quite large. Nine sites within the LIGO collaboration (each operated independently) currently provide computing facilities based on commodity cluster computing. The scientists who have the expertise to perform this analysis are spread across 41 institutions on several continents, and this community is growing all the time. The key challenge for LIGO is to get the data from the LIGO detectors to the sites where analysis happens and to make those sites accessible to the participating scientists.

The data management challenge faced by LIGO is therefore to replicate approximately 1 TB/day of data to multiple sites securely, efficiently, robustly, and automatically; to keep track of where replicas have been made for each piece of the data; and to use the data in a multitude of independent analysis runs. The nine sites each use mass storage systems, but different systems are used at different sites. Scientists and analysts need a coherent mechanism to learn which data items are currently available, where they are, and how to access them. More specific requirements include the following.

  • When high-bandwidth network links (10+ Gb/s) are available, they should be utilized efficiently: there should not be unused bandwidth while data transfers are taking place.
  • All network links should be used efficiently: there should be no idle time on the network between transfers.
  • Scientists should be able to to locate data and understand data items using application-level terms (also known as "metadata").
  • Scientists should be able to locate replicas (copies) of any data item using database queries.
  • Data transfer endpoints should be authenticated using "strong" security, and data transfers should preserve data integrity.

The Solution

To meet this challenge, the LIGO Scientific Collaboration (a cooperation between physicists, computer scientists, and information technology experts) developed the Lightweight Data Replicator (LDR), an integrated solution that combines several basic Grid components with other tools to provide an end-to-end system for managing LIGO's data. LDR's features make it very useful for replicating data sets to multiple sites within a joint project.

  • LDR is intended to be the simplest, easiest-to-maintain software that meets the LIGO requirements. It is based on existing open source Grid components, adding control logic to orchestrate the replication tasks.
  • LDR uses network links efficiently because it uses parallel data streams, tunable TCP windows, and tunable write/read buffers. LDR also manages continuous data transfers between sites, ensuring that the network is always being used when there are transfers that need to happen.
  • LDR tracks where copies of specific files can be found, given that it may have been replicated at several sites. This information can be queried by user applications so that local data is used when it is available. Each site maintains a catalog of the data it contains, which can be queried directly. The catalogs update each other so that so that one can query any member site's catalog to locate a particular file.
  • LDR also stores descriptive information (metadata) in a database. Hence, site administrators can select groups of files for replication based on descriptive fields rather than having to specify the name of every file.
  • LDR uses the Grid Security Infrastructure (GSI) to authenticate clients and services.

LDR combines several existing Grid components-- Grid Security Infrastructure (GSI), GridFTP, Globus Replica Location Service (RLS), a metadata catalog service, and PyGlobus--with customized Python daemon code to provide a solution that meets the requirements above.

Results

The LIGO detectors went online and began producing data in 2002. By April 2005, the LIGO team had used LDR to replicate well over 50 terabytes of data to sites including CalTech, MIT, Penn State University, and the University of Wisconsin Milwaukee, along with sites in Europe including the Albert Einstein Institute in Golm, Germany, Cardiff University, and the University of Birmingham.

LDR was developed by Scott Koranda, Brian Moe, and Kevin Flasch at the University of Wisconsin-Milwaukee in cooperation with the NSF-funded GriPhyN and iVDGL projects and the DOE-funded SciDAC Data Grid Middleware project. Many members of the LIGO Scientific Collaboration have contributed to deploying and testing LDR.

Having learned from the LIGO project's success with LDR, members of the Globus Alliance have designed and implemented a Data Replication Service (DRS) that provides a pull-based replication capability similar to that provided in the LIGO LDR system. DRS is implemented as a Web service that complies with the Web Services Resource Framework (WS-RF) specifications. The DRS is available as a technology preview in the Globus Toolkit 4.0, and the LDR team plans to test DRS as a potential replacement for LDR's data publishing component. The long-term goal is for the Globus Toolkit to provide reusable services in this area (like DRS) so that applications like LDR can be created with little or no new code.

Detailed Information

The following links provide more detail about the Lightweight Data Replicator (LDR).