Data Pipeline

1. The operator creates an initial (“sailing orders”) database record with the cruise id, ports/dates, mission summary, survey targets, science party, and award numbers. This record may be created online via an authenticated Web form at rvdata.us, or submitted by email as a XML record for automatic ingest.

2. Using an R2R-supplied application, or (optionally) software provided by the ship operator, the operator will initialize the R2R Event Log for the cruise when the vessel leaves port. The R2R application will be available as a shipboard webapp so scientific staff designated by the Chief Scientist can create event log entries. A web API will also be available for event log entry via science software if the chief scientist wishes. Best practice documentation provided by R2R will be made available to cruise participants who will be responsible for entering events and maintaining the event log during the cruise. At the end of the cruise, the operator will close the event log, generating a final copy to be included in the cruise data distribution.

3. At the end of a cruise, the operator creates an “end-of-cruise” data distribution including:

    • Underway data sets organized in the operator’s standard directory structure;
    • Science party files captured in a “science dropbox” directory;
    • Event logs from the R2R Event Log application;
    • Any/all supplemental documentation (instrument manuals, calibration sheets, Operator and National Facility operator reports, etc)

Copies of the data distribution are delivered to the science party and R2R repository, by whatever mechanism is convenient (tape/drive/disc by mail, or ftp/rsync network transfer).

4. When data distribution is received from an operator, the R2R Data Manager creates a file-level inventory with the operator id, vessel id, shipment id, cruise id, path/name, size, date, and checksum for verification. (This step can be performed routinely and reliably using the “md5deep” open-source application.) The file inventory is stored in the relational database, and posted for public access on the rvdata.us Web portal. The operator and Chief Scientist are automatically notified (by RSS feed and email, respectively) and invited to review the file-level inventory to flag missing data.

5. A copy of the data distribution is sent to deep storage at the SDSC Triton facility during the proprietary hold period, from which data may be recovered on request from the operator or Chief Scientist.

6. The R2R Data Manager, in collaboration with the operator, creates a “distribution profile” that maps each class (“multibeam”, “CTD”, etc), vendor, and model of each routine underway instrument system to the standard directory where data files are found. This profile is then applied to a cruise’s file-level inventory to automatically “classify” the files into data sets. We will engage the operators in an effort to create a fleet-standard template (directory structure and file naming convention) that all vessels may eventually adopt.

7. The R2R Data Manager creates a clean/final navigation track for the cruise, calculates a control point geometry (minimized trackline) and loads it into the relational database as a spatial object. This control point geometry will be published as both a OGC Web Feature Service (WFS) and Web Map Service (WMS), so that it may displayed on a track map in the rvdata.us Web portal as well as syndicated to data centers and remote applications.

8. The operator and Chief Scientist are both granted an authenticated login at the rvdata.us Web portal where they may review, amend, and approve the complete cruise record. Depending on cruise objectives and data analysis plans, the Chief Scientist may assert proprietary holds on data sets at this point if necessary. Once the cruise record is approved, it is published on the rvdata.us Web portal via a OAI-PMH service and submitted to inter/national catalogs including Geospatial One-Stop (www.geodata.gov) and POGO (www.pogo-oceancruises.org). The R2R Data Manager also applies a XSLT stylesheet to the cruise record to create a standard “Operations Report”.

9. Automated QA procedures and generation of QACs will be implemented as they are developed, one data type at a time beginning with geophysical data sets. (The development of full QA protocols will not prevent R2R from delivering original field data to the NDCs.)

10. The R2R Data Manager may also calculate and store higher-order file-level metadata (spatial/temporal indexes and class-specific details) for particular data types during the assessment phase, such as MB-System metadata for multibeam data files. We envision ultimately storing a spatial index as a database geometry for every routine underway data file, in order to apply geospatial searching over EEZ gazetteers and easily satisfy foreign clearance requests.

11. As each data set is cleared of any proprietary holds, the R2R Data Manager assigns a persistent unique identifier to the data set and submits it to the appropriate NDC for permanent archival and dissemination. We have already demonstrated a pipeline to routinely submit multibeam data sets to the NGDC via an OAI-PMH service. Once the data set is accessioned at the NDC, the R2R database record is updated with the remote url for direct download. No original field data will be served locally from the rvdata.us Web portal, since all routine underway data types will be archived at NODC or NGDC.

12. When all data sets are cleared of proprietary holds and data are made available at NGDC or NODC, the full cruise distribution as was received by the Chief Scientist is migrated from SDSC Triton to NGDC for final (permanent) deep storage. Original cruise data distributions may be recovered from NGDC deep storage via R2R on request from the operator or Chief Scientist.