Thanks to the success of the Copernicus program and the general awareness towards satellite Earth Observation (EO) data, a growing number of cloud-based EO services are now offered on the European and global market for working on and with the available EO data. From the user perspective this is currently creating confusion due to the large number of available services and the lack of comparability between the offers and an inherent risk of vendor lock-in, when selecting offers based on propriety and/or closed source solutions.
Alongside other issues like the growing size of data to handle and computational requirements have led to the development of the openEO API (see https://openeo.org ) since 2017, which already greatly reduces the risk of vendor lock-in, when a sufficient number of back-ends is available. This project was successfully concluded at the end of 2020 and has provided the first version of openEO API, which has been implemented in a growing number of cloud back-ends and three different client libraries, supporting R, Python, and JavaScript users.
Under the umbrella of ESA in form of openEO Platform this work is being continued, and the concepts are further evolved, by introducing new aspects of federating different cloud back-ends that go much further than just offering the same interfaces from different back-ends. openEO Platform has the goal to provide openEO as a service to EO data users, where they can easily access all kinds of data and processing, share results, and potentially offer their own value-added services on top of the platform (see https://openeo.cloud ).
Newly added features include a single sign-on solution, data and process harmonization, integration of commercially offered datasets, shared accounting and billing procedures between the integrated back-ends, marketplace offerings of user generated applications and workflows. Driven by a number of challenging use cases, new processing capabilities are also introduced and defined in openEO, including generation of analysis ready data (on-demand and on-the-fly) following CARD4L recommendations, machine learning, regression modelling, sampling and improved time series modelling.
The federation on top of which openEO Platform is built includes existing and new features, allowing for a much more seamless user experience than previously possible. A comprehensive library of standardized, well-documented processes has been defined and a set of core processes to be supported by each federation member is currently in development. Alignments in the implementation and availability of those pre-defined processes are key for true interoperability in the federation. The same goes for the numerous data collections that are offered in openEO platform from all the currently participating back-ends such as Terrascope, the Earth Observation Data Centre (EODC) and Sentinel Hub via the Euro Data Cube. All data providers adopted metadata defined by the SpatioTemporal Asset Catalog (STAC) and moreover naming conventions of the elements defining this metadata such as collection names and band names are harmonized in the federation. Shared user identity management allows for a single point of entry for users, implemented through EGI-Check-in, which also allows for further integration with the European Open Science Cloud (EOSC).
All newly implemented core components of the federation enable now the development of distributed processing of workflows. Federated back-ends providing required data and processes will then be able to collectively work on larger jobs or complement each other in case of missing data or processing capabilities.
The EU Copernicus programme has established itself globally as the predominant spatial data provider, through the provision of massive streams of high resolution Earth Observation (EO) data. These data are used in environmental monitoring and climate change applications supporting European policy initiatives, such as the Green Deal and others. To date, there is no single European processing back-end that serves all datasets of interest, and Europe is falling behind international developments in big data analytics and computing. This situation limits the integration of these data in science and monitoring applications, particularly when expanding the applications to regional, continental, and global scales.
The C-SCALE (Copernicus - eoSC AnaLytics Engine, https://c-scale.eu) project federates European EO infrastructure services, such as ESA’s Sentinel Collaborative Ground Segment, the Copernicus DIASes (Data and Information Access Services under the EC), and independent nationally funded Earth Observation service providers, and European Open Science Cloud (EOSC) e-infrastructure providers.
The C-SCALE federation capitalises on the EOSC's capacity and capabilities to support Copernicus research and operations with large and easily accessible European computing environments. That allows the rapid scaling and sharing of EO data among a large community of users by increasing the service offering of the EOSC Portal.
By making such a scalable Big Copernicus Data Analytics federated services available through EOSC and its Portal, and linking the problems and results with experience from other research disciplines, C-SCALE helps to support the EO sector in its development and furthermore enables the integration of EO data into other existing and future domains within EOSC and beyond, e.g. the ESA openEO Platform activity (https://openeo.cloud). By abstracting the set-up of computing and storage resources from the end-users, C-SCALE enables the deploying of custom workflows to quickly and easily generate meaningful results. Furthermore, the project will deliver a blueprint, setting up an interaction model between service providers to facilitate interoperability between commercial (e.g. DIAS-es) and public cloud infrastructures.
Pangeo reinvents the concept of ‘platforms’: it is not anymore a determined infrastructure that offers a certain service but rather a floating and adaptable ecosystem of components that offer the same user interface to access, manipulate and process at different levels scientific data, adapting to the underlying resources available. This pan-scale and cross-infrastructure capability can function as the base for authentic open interoperability among different solutions. It keeps users inclusively free to decide where is more convenient for them to explore, prototype, exploit data and eventually, later on, according to their necessity, finalise the production over the most suitable resource. As scalability is based on modularity it offers several advantages over fixed platform setups, in that it a) lowers the entry barrier allowing effectively a single computer to become the platform, b) enables scalability as the capacity of a platform will only depend on the power of the incorporated components, and c) paves the way for truly open federated platforms as any compatible component or module can join independently of others and bring its assets.
Breaking the barriers of confinement that other solutions impose will define a situation where firms will have to attract users not with the more advanced interface, but with the most convenient solution money-wise. Moreover, using an open-source, community-driven platform will not concentrate on a specific competitor's capability to drive the market but will let anyone participate in the definition and adaptation to future necessities.
As one of Pangeo pillars is the portability of the platform It is open for a large variety of scenarios with the possibility to run it in a securely confined solution or promoting the reproducibility of analysis following the OpenScience paradigm. Both of these worlds will share the same approach that will then benefit from the increase of stakeholders and consequently development capacities. The same strategy can be found in the openness to different geospatial domains where the lack of a priori focus on a specific domain opens the possibility to cross domain cooperation.
As the entire project is driven by an ensemble of components, where for each there is already a community that maintains, documents and plans its evolution. The role of the Pangeo community is focused on acting as a coordinating body between these communities and in covering more comprehensive aspects of scientists and engineers, software and computing infrastructure. Having an already well-organised open community is the key to not reinventing the wheel and decentralize the decisions for the future.
To prove the value of the concept,, the platform is already in production phase over different infrastructure (among other AWS, Google Cloud, 2i2c, JASMIN platform, CNES HPC, IFREMER and many others over almost all the continents) that shape a federation of Pangeo deployments where users are able to move from one to another without any constraint. Moreover Pangeo shares many underpinning features of a multitude of platforms that are appearing on the market allowing to define a minimum requirement level to be considered part of the Pangeo project, allowing the possibility to expand the capability according to the specific necessities.
A decentralized approach, flexibility and openness are key to win the challenges that we can’t foresee today and to create a platform that can be maintained even in mutated scenarios that can't be yet envisioned but that can be influenced by what we are building.
CNES EO Data & Services platform and its integration in French Earth System Research Infrastructure ‘Data Terra’ CNES has started the development of a unified portal of all its Earth observation data with the objective of better serving its users, in particular for transdisciplinary applications. This platform will be based on a common technical base CNES 'Platform' already under development. The first version is expected at the end of 2022. This platform will propose a single point of access to CNES EO data. It will include a knowledge portal allowing users to discover datasets outside their usual theme, to access all resources (documents, software, training, publications) to facilitate their reuse. It will also offer an access portal allowing advanced distribution of EO data (downloading, interactive processing, Earth Analytics Labs (eg: PANGEO notebooks, Nocode interfaces, different flavors of datacubes, ...). The implementation of this platform will be accompanied by an improvement of the data management practices (generalization of DMPs, uniform application of a CNES data policy, …). Given the very large volume of EO data (several tenth of PB), this platform will promote the move of processing to the data platform. Thus, in particular through Earth Data Labs, users will be able to develop their algorithms and their processing chains in a context that will facilitate scaling up and switching to operational mode. Thus, the treatments will no longer be carried out on the user's machine but on efficient infrastructures from an energy point of view. The first of these is the CNES computing center. But the treatments, as the case may be, can take place at external HPC / HPDA centers (for example national HPCs or the future Exascale of EuroHPC). There are also plans to move closer to public clouds to better serve the private sector. This EO Data & Services Platform is integrated in a bigger one: Data Terra. Data Terra is a research infrastructure dedicated to Earth System observation data. Created in 2016, it falls within the French Ministry for Higher Education, Research and Innovation (MESRI) national roadmap. It mobilizes more than 170 Full Time Equivalent (FTE)/Year, distributed over more than 400 people from the 19 partner organizations (CNRS, CNES, IFREMER, IRD, IGN, BRGM, …). This research infrastructure is based on four data hubs covering each of the major compartments of the Earth System: land surfaces (THEIA), atmosphere (AERIS), oceans (ODATIS) and Solid Earth (FORM@TER. Each data hub aims to facilitate access to satellite, airborne and in-situ data acquired and managed by research laboratories or federative structures (Universe Science Observatories (OSU), Research Federations, …), by national infrastructures such as National Observing Services (SNO), Environmental Research Observation and Experimentation Systems (SOERE), and by the oceanographic fleet, aircraft, balloons and space missions. Data Terra is a distributed platform (more than 30 Data and Services Centers). Its backbone is made up of 8 mains sites (Brest, Grenoble, Lille, Montpellier, Orléans, Paris, Strasbourg, Toulouse) linked by a high-performance network (GEANT / RENATER) and grid technology (eg iRODS or Rucio). The CNES platform is one of them. As with the CNES platform, Data Terra's ambition is to promote transdisciplinary work, beyond the compartments of the Earth system. This to address complex societal demand such as climate change and its adaptation, natural risks, coastal area monitoring and modelization, ...). It will offer the same type of advanced services for data access (visualization, Earth analytics labs, systematic processing of data). Data Terra also aims to be integrated into the data ecosystem in Europe: ENVRI, EOSC, Destination Earth, GAIA-X. And also at the international level: GEO, CEOS, RDA, ...
NASA EOSDIS represents the largest Open Data holdings of Earth Observation (EO) data, currently at nearly 60PBs and expected to grow to over 150PBs over the next several years. As part of NASA’s Open Science, and more specifically the Open Source Science initiatives, NASA EOSDIS has sought out means of improving the discoverability, access, and use of our EO data.
Starting in 2019, NASA began extending download-oriented, on-premises data hosting to directly accessible cloud hosting options. By leveraging cloud native data formats such as Cloud Optimized GeoTIFFs (COGs), cloud optimized metadata and access extensions like OPeNDAP’s DMR++ and Zarr, and community standards based metadata including Spatio-Temporal Asset Catalogs (STAC), NASA’s EOSDIS data is now more readily available and accessible than ever before.
Use of NASA data, for both research and direct application, frequently exploits multiple products, potentially from multiple organizations, in combination with locally collected data. While challenges of data analysis such as alignment (projections, grids, etc.), QA application, and format conversions are not new to this space, the direct accessibility of cloud hosted data and compute near the data provides previously unachievable patterns for data and platform interoperability. Open source tooling, such as Pangeo, Dask, and XArray, coupled with an appropriately designed geospatial data lake and supporting infrastructure, now make multiproduct, highly scalable research and analysis possible.
This talk explores the technologies, interaction patterns, opportunities, and challenges associated with embracing cutting edge community standards and how NASA offers open, high performance access-in-place capabilities for its Earthdata Cloud petabyte-scale geospatial data lake.