PHIDIAS: Continuing to boost cloud services for marine data management, services and processing
Observing the ocean is challenging: missions at sea are costly, different scales of processes interact, and the conditions are constantly changing. This is why scientists say that "a measurement not made today is lost forever". For these reasons, it is fundamental to properly store both the data and metadata, so that access to them can be guaranteed for the widest community, in line with the FAIR principles: Findable, Accessible, Interoperable and Reusable.
The PHIDIAS added value
Long-term data archiving procedures have been specified in the frame of this Work Package and rely on the HPC (high-performance computing) and HPDA (high-performance data analytics) expertise of the PHIDIAS partners.
The PHIDIAS use case in ocean data testing, led by IFREMER, together with Europe's leading research groups in ocean studies, such as Université de Liège, MARIS, CNRS, CSC, Finnish Environment Institute, in the coordination of CINES, the leading HPC centre in France, with support from other PHIDIAS partners that are actively engaged in this initiative, aimed at improving the usage of cloud services for marine data management, data services to the user in a FAIR perspective, and data processing on demand.
These goals are being pursued by considering the following three main tasks:
1. Improvement of long-term stewardship of marine in-situ data. The SeaNoe service allows users to upload, archive and publish their data, to which a permanent identifier (DOI) is assigned so the dataset can be cited and referenced. Efforts will be articulated around the scalability, the exchanges between data centres in charge of related data types and the protection of long-time archives. The long-tail data (measurements acquired more randomly, e.g. during a scientific cruise or manual work) are of particular interest.
2. Improvement of data storage for services to users. The goal is to provide users with (1) fast and interoperable access to data from multiple sources, for visualisation and submitting purposes; (2) parallel processing capabilities within dedicated high-performance computing, using, for example, Jupyter notebooks or the PANGEO software ecosystem.
3. Marine data processing workflows for on-demand processing. The objective is that users can access data, software tools and computing resources in a seamless way to create added-value products, for example quality-controlled, merged datasets or gridded fields.
The Use Case Working Plans
The final version of the Ocean Use Case will be ready for month 34. In order to fulfil the scientific goals of the use case, the work plans are mostly focused on technical developments and implementation of tools. In particular, the tools related to the long-term archiving of both data and metadata (Deliverable D6.1.2) and the storage and archiving of large salinity datasets (Deliverable 6.2.2) from in situ (SeaDataCloud) and from satellite (SMOS mission) have to be developed or improved.
The team has listed different topics as part of their working plan for the forthcoming months, also providing the corresponding proposed solutions.
- Service scalability: Due to technical limitations, the maximum allowed size for data upload is presently 0.2 Tbytes.
Solution: use of other protocols, which can be asynchronous, for example, Virtual File Systems. Sharing the allocation of necessary storage resources from different infrastructures.
- Back-office exchanges: In order to make long-tail data available in data collections, many exchanges (performed manually at present) between the involved Data Centres are necessary.
Solution: implementing iRODS (Integrated Rule-Oriented Data System) data flows to automate these exchanges and make them more efficient.
- Securing long-term archive: Data Centre infrastructures for data archiving are not always suitable for long-term archiving and dedicated staff are not always available.
Solution: rely on professional long-term repositories instead and distribute dataset storage across different geographically distributed repositories. This could be achieved by using, for instance, iRODS data flows.
- Fast access to datasets: In situ datasets are made available among a wide range of systems, making the assembly of multidisciplinary datasets more difficult for users.
Solution: use a working copy data, called a technical cache or Data Lake, with a suitable structure in order to speed up and facilitate data processing. Data Lake will be periodically synchronised with the Data Centres and Data publication services.
- On-demand processing: Using specialised tools requires the installation of software and the availability of computing resources. The former can be time-demanding for users.
Solution: deployment of the DIVAnd interpolation software tool (Deliverable 6.3.1) in a virtual machine in order to provide a significant improvement on what researchers or data experts typically have access to from their office.
Supporting the EU's policy of Open Science
The European Commission has sought to advance the open science policy from its inception in a holistic and integrated way, covering all aspects of the research cycle from scientific discovery and review to sharing knowledge, publishing, and outreach.
PHIDIAS is supporting the EU's policy of open science and its goals will be pursued in line with the development of the European Open Science Cloud (EOSC) and the Copernicus Data and Information Access Services (DIAS).
The result of this use case should contribute to improving the activities of the researchers and specialists, which can be split into different aspects.
Data publication: with the improved capabilities of SeaNoe, they will be able to seamlessly upload large datasets, ensure their long-term archiving and publish them following standards, best practices and recommendations from Data Management groups. This will also enhance the ingestion of long-tail data, which in turn will be made available to a larger community.
Data access: thanks to fast access to the most recent data collections obtained from different sources and providers (Euro-ARGO, SeaDataNet, EMODnet, CMEMS, imaging flow cytometer), users will be able to perform operations such as sub-setting (based on regions, parameters), quality-control, visualisation or spatial interpolation.
Data processing: the deployment of cutting-edge tools such as DIVAnd (spatial-temporal interpolation) in an HPC environment will allow scientists and experts to perform spatio-temporal interpolation of large datasets. In particular, this use case will be the North Atlantic Ocean and the Baltic Sea, which represents 10 million observations for a total of approx. 250 GBytes.
The final product will consist of an inter-comparison of satellite data and in-situ data (Deliverable 6.4) of sea surface salinity, including Inspire -compliant online services for data visualisation and access.
Stay tuned for our regular updates and insights.