The European Organization for the Exploitation of Meteorological Satellites (EUMETSAT) provides a prototype Data Cube for Drought and Vegetation Monitoring. This prototype consists of long-term data records on a regular latitude / longitude grid in CF-compliant netCDF, using a model consistent with the Copernicus Climate Change Service Data Store and is provided via THREDDS. Together with this data cube tools to manipulate the data in the cube are provided.
The prototype seeks to explore how well EUMETSAT and partners can bring together data from multiple sources and from multiple grids to ease barriers to use of the data for thematic applications.
The cube was created using the EUMETSAT Data Tailor (https://www.eumetsat.int/data-tailor) and includes parameters for drought and vegetation monitoring: various vegetation parameter (NDVI, Fractional Vegetation Cover, Leaf Area Index, Fraction of absorbed photosynthetically active radiation), global radiation, direct normalized solar radiation, sunshine duration, land surface temperature, reference evapotranspiration, soil wetness index in the root zone, precipitation, and 2-m air temperature. These data come from the portfolio of EUMETSATs Satellite Application Facilities (SAF) on Climate Monitoring (CM SAF), Land Surface Applications (LSA SAF) and Support to Hydrology and Water Management (H SAF), as well as data provided by the Global Precipitation Climatology Centre (GPCC) and ERA-5 data from Copernicus / European Centre for Medium Range Forecasting (ECMWF). The time period covered by each of the data records differs, as they have different starting dates. However, the earliest available starting date has been chosen for each of the data records. As this is a static cube, it has a defined end date and is not updated with near-real time data.
It takes effort to make such cubes – is it worth it and what is the future? This presentation reports on the lessons learnt as regards the creation, provision and use of the data cube. We demonstrate the tools that are provided together with this cube and summarize some first user feedback and share our ideas for the future.
The available information on our environment derived from Earth Observation and other sources like mathematical models and in situ measurements has been ever growing with no apparent slowdown in sight. While this data richness untaps unprecedented research possibilities and allows for a holistic understanding of our planet, it also necessitates new technological approaches for the joint exploitation of numerous data streams. Despite considerable efforts for standardisation, data formats and models as well as interfaces for data access remain diverse, requiring costly solutions for harmonisation to be developed and maintained. The open-source Python package xcube addresses such requirements and offers comprehensive tools for transforming arbitrary data sets into analysis-ready data cubes as well as a growing suite of tools for their exploitation.
xcube provides a plugin-based store framework to integrate data sources served via web APIs or via different storage types. By this means, lazy data cube views resembling the Common Data Model, well-known from NetCDF, are created and greatly facilitate convenient on-the-fly access to large data repositories. Data cubes can be persisted by using xcube Generator, which allows for tailored configurations for the transformation process including the application of arbitrary source code to cube variables. xcube is based on Python’s popular data science stack, particularly on xarray, dask, and zarr, and extends the functionalities of these packages by methods for typical operations on geographical data cubes. Examples include processing or masking and clipping also with arbitrary vector shapes.
The xcube ecosystem offers far more than data access and transformation functionality. xcube Server can provide tiles or time-series from data cubes, both as images (also according to OGC WMTS) or raw data through an S3-compatile REST API, facilitating the integration into existing applications and workflows. The most widely used application is currently xcube Viewer, a web viewer with image and time-series visualisations and a functionality to visualise the results of simple processing workflows on-the-fly. cate, the web-based toolbox of ESA’s Climate Change Initiative, uses xcube in the back end as well, leveraging the software’s outstanding capabilities to handle large, gridded data sets.
We will present here several successful activities and real-life examples from different application areas, which all rely on the xcube ecosystem to reach their objectives. Examples include the Euro Data Cube, which offers generic, automated services to everyone, Brockmann Consult’s operational water quality services for various European institutions and companies, and several research activities using xcube for their processing and machine learning tasks. The presentation will conclude with a glimpse on xcube’s packed roadmap for the coming months.
Earth System Science (ESS), as the name suggests, adopts a holistic approach to describing, understanding, and even predicting the dynamics of the Planet’s complex system. ESS is truly interdisciplinary involving numerous natural and social sciences and, as a quantitative discipline, very data hungry. The unprecedented growth in available data on our environment from Earth Observation and other sources made possible many exciting research approaches, but at the same time also led to an ever-increasing effort required for establishing data access and making heterogeneous data stream ready for joint analysis. As a consequence, researchers see themselves confronted with challenging engineering tasks in addition to the scientific challenges.
The “Earth System Data Lab” (ESDL), an ESA activity recently completed, has been addressing this issue by offering a comprehensive Earth System Data Cubes with tens of relevant data for ESS. In addition, a virtual laboratory offered a ready-to-use environment for analysis and processing of the data cube. Four sophisticated use case have been successfully implemented demonstrating the wide range of applications enabled by the ESDL approach. Likewise, a group of Early Adopters from different disciplines have been implementing self-consistent projects, sometimes involving the more than 80 variables in the cube, while others only used a small subset of the data offer.
Besides major technical developments, the main achievement of the project has clearly been the scientific output. Numerous presentations and 14 manuscripts from various researchers have been prepared, submitted, or accepted at the end of the contract and more followed after the end of the activity. This success underlines the scientific potential to be unleashed by removing major technical obstacles in the research process. At the end of the activity, the lesson’s learned have been clear, also thanks to valuable feedback from the heterogeneous user community. While the strict data cube approach adopted in ESDL, i.e., one cube with a static grid, has clear advantages for specific empirical approaches, it is too rigid and involves considerable modifications to the original data. Several users therefore asked for customisable cube generation, in terms of data to be included, the target grid, pre-processing algorithms to be applied and other aspects. Also, the long-term perspective has been a frequent question from users when facing a decision for an infrastructure for their research. In terms of scientific evolution, clearly the application of state-of-the-art deep learning approaches to ESS questions will be needed in the future. The service will then need to be optimised to better support such applications. These lessons learned have been favourably received by the Agency and included as requirements in a recently closed tender named Deep ESDL, which will perpetuate the success story of data cube in Earth System Sciences.
With a steadily increasing volume of freely available satellite imagery, novel solutions like Earth Observation Data Cubes (EODC) and Analysis Ready Data (ARD) are growing in importance. These topics are not only accompanied by an evolving ecosystem of open tools and standards, but also a heightened interest of users to explore the data by performing dense time series analyses over various spatial scales. While commercial cloud-based platforms can offer analysis on national to global scales, leveraging already available computing resources (e.g., a university’s High Performance Computing system) to perform analysis over regional areas of interest will continue to be of considerable importance in the near future for many users.
Even if necessary computing resources are available, the processing of Earth Observation data to a high-quality level suitable for long time series analysis requires expert knowledge, which is particularly true for Synthetic Aperture Radar (SAR) data. Image artifacts remaining after single image processing will contaminate time series and are difficult, if not impossible, to be removed in multi-dimensional ARD cubes. Additionally, the choice of spatial reference system, pixel grid, resampling, file format and image tiling play a large role in optimally storing data for efficient access and computations. However, if these challenges are overcome, an increased statistical robustness of numerous applications is possible, such as the derivation of forest cover [1], mapping of wetland characteristics [2] and land cover seasonality characterization [3]. Moreover, the development of more complex filtering approaches preserving a higher level of spatial detail than previous methods is possible [4].
In particular, the continuously growing SAR user community can greatly benefit from high-quality ARD, such as the newly proposed Sentinel-1 Normalised Radar Backscatter (S1-NRB) product, which is intended to be a global and consistently processed ARD product that aligns with the NRB specification [5] proposed by the CEOS Analysis Ready Data for Land (CARD4L) initiative. It offers high-quality, radiometrically enhanced SAR backscatter data as well as ancillary data layers, conversion layers for different backscatter conventions and extensive metadata. Furthermore, the S1-NRB product implements technological developments, such as Cloud Optimized GeoTIFF (COG) and SpatioTemporal Asset Catalog (STAC).
The availability of an ARD product can greatly accelerate data preparation, but users face various challenges regarding the analysis of multi-temporal data cubes. During analysis of SAR time series, for example, the question of time series composition eventually arises regardless of application. Selected data acquisition characteristics, like orbit and incident angle, are often limiting possible mapping applications. The time series of an individual pixel can include measurements situated in near, mid or far range of each SAR scene in the stack, depending on the respective acquisition orbit. These phenomena need to be accounted for during analysis and thus a tradeoff between temporal density and variability is often inevitable.
As part of assessing the quality and handling of the S1-NRB product, the variability of backscatter time series was systematically quantified over different land cover classes for an area of regional scale with the aim to guide future SAR data cube users to the choice of data applicable to their individual use cases. Different combinations of, amongst others, acquisition orbit, track and frame, were investigated by computing multi-temporal statistics and quantifying their differences. We intend to present the most important results of this study, and furthermore, to show how a collection of S1-NRB scenes can easily be accessed as an on-the-fly data cube.
[1] Heckel, K., Urban, M., Schratz, P., Mahecha, M.D., & Schmullius, C. (2020). Predicting Forest Cover in Distinct Ecosystems: The Potential of Multi-Source Sentinel-1 and -2 Data Fusion. Remote Sensing, 12. https://doi.org/10.3390/rs12020302.
[2] Slagter, B., Tsendbazar, N.-E., Vollrath, A., Reiche, J. (2020). Mapping wetland characteristics using temporally dense Sentinel-1 and Sentinel-2 data: A case study in the St. Lucia wetlands, South Africa. International Journal of Applied Earth Observation and Geoinformation, 86. https://doi.org/10.1016/j.jag.2019.102009.
[3] Dubois, C., Mueller, M.M., Pathe, C., Jagdhuber, T., Cremer, F., Thiel, C., & Schmullius, C. (2020). Characterization of Land Cover Seasonality in Sentinel-1 Time Series Data. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, V-3-2020, 97-104. https://doi.org/10.5194/isprs-annals-V-3-2020-97-2020.
[4] Cremer, F., Urbazaev, M., Berger, C., Mahecha, M.D., Schmullius, C., & Thiel, C. (2018). An Image Transform Based on Temporal Decomposition. IEEE Geoscience and Remote Sensing Letters, 15, 537-541. https://doi.org/10.1109/LGRS.2018.2791658.
[5] CEOS (2021). Analysis Ready Data For Land: Normalised Radar Backscatter. Version 5.5. https://ceos.org/ard/files/PFS/NRB/v5.5/CARD4L-PFS_NRB_v5.5.pdf.
The global deterioration of the quality of the air we breathe is a pressing concern for citizens, scientists and policymakers as it has been identified as a major threat to health and climate. The scientific community has agreed on the fact that the worsening of air quality is responsible for around 9% of yearly deaths worldwide. In this context, the understanding of the physicochemical dynamics of trace gases and air pollutants, the analysis of emissions from both human and natural processes, and continuous monitoring of pollutant ambient concentrations represent key tasks to achieve higher ambient air quality conditions. As requested by the United Nations Agenda 2030 through the Sustainable Development Goals (SDGs) (e.g. 3, 11 and 13 that addressed directly well-being, sustainable communities and life on earth).
Nowadays, measurements from both ground and satellite sensors are employed in air quality monitoring and analysis. The combination of these data sources, rather than the sole use of traditional ground-sensors observations, has allowed scientists to study air pollution in areas where sensors are not present. Combining ground and satellite sensor observations in air quality studies can be puzzling due to the time and effort required for practitioners to integrate and manipulate - generally - large volumes of heterogeneous data. However, innovative data exploitation tools, supporting these intensive handling and computational operations, have today reached a level of maturity that allows for complex environmental analysis tasks such as the concurrent use of ground and satellite sensor observations in air quality monitoring.
An example of the above is the Data Cube. Data cubes are infrastructures designed to store multi-layered dataframes that can provide information to the user in a homogeneous format. When the data is organized and accessed in data cubes, integration time-effort compared to preprocessing procedures is drastically reduced. This allows the user to concentrate and spend more time on post-processing and analysis. One of the most popular and data cube implementations is the Open Data Cube (ODC, https://www.opendatacube.org). ODC provides facilities that act as an intermediary layer between satellite Earth Observation (EO) data and end-users. ODC is an open-source software released under the Apache 2.0 license, consisting of a set of Python tools that help the user explore and interact with satellite data. The software provides command-line applications, built-in statistical analysis tools, a Web User Interface, a graphical data explorer and support to Jupyter Notebooks which works as an interface allowing users to develop custom applications. Furthermore, ODC supports the Open Geospatial Consortium (OGC) standards for data publishing, thus allowing integration into most of the geospatial software frameworks. Currently, more than 100 countries are developing national data cubes platforms based on ODC. Successful examples for ODC implementations are, for example, the Digital Earth Australia (https://www.ga.gov.au/dea), the Digital Earth Africa (https://www.digitalearthafrica.org), the Swiss Data Cube (https://www.swissdatacube.org) and the Vietnam Open Data Cube (http://datacube.vn).
The ODC is a tool originally designed for aiding access and analysis of satellite data. One of the main drawbacks is that current deployments only ingest Analysis Ready Data (ARD) products from a few satellite platforms such as the Sentinel-1, Sentinel-2 and Landsat-8. In order to empower ODC applications with additional satellites and, eventually, ancillary data sources, development work is required to adapt OCD routines. With this in mind, the present work proposes the design and the implementation of an ingestion pipeline that systematically indexes non-ARD data from satellites currently not supported by the ODC, and also ancillary data such as ground-sensors observations. The practical use of the developed ODC implementation is then tested to compute correlation analysis between ingested satellite and ground sensor observations.
As a case study, this work focuses on air quality data from the Sentinel-5P (the most recent Earth Observation platforms of the European Copernicus Programme providing estimates of air pollutants with daily global coverage) and traditional geolocated time-series provided by air quality ground stations. The selected study area was the Lombardy region (Northern Italy). The Lombardy region is one of Europe's most densely inhabited places, and it suffers from severe air pollution. This region is also a pollution hotspot due to its unique micro-climatic characteristics, which include wind channelling along the Po-River valley and frequent thermal inversions in mountainous places, which prevent pollutants from dispersing properly in the lower atmosphere. Air quality ground observations for the Lombardy region are provided by the Lombardy Regional Environmental Protection Agency (ARPA) which manages the local authoritative environmental sensors network. Nitrogen Dioxide (NO2) was selected as the target pollutant as tropospheric estimates are provided by the Sentinel-5P as well as its concentration records are available from the ARPA sensors. Furthermore, NO2 emissions in the lower atmosphere are mainly connected to combustion processes from domestic heating, transportation and industrial activities which are largely present in the study area.
An example of correlation analysis between the ARPA Lombardia and the Sentinel-5P NO2 data was performed by leveraging exclusive the generated ODC products as follows. The datasets were extracted from the ODC by using the Python XArray library and overlaid into a single Pandas dataframe. The analysis covered a period from January 1st 2020 to April 14th 2021. Correlation coefficients were computed for co-located time-series on Sentinel-5P and ground sensor observations. Results of the correlation analysis demonstrated a strong positive correlation between measurements. The Pearson correlation coefficient (rp) between measurements had a mean larger than 0.7 and a similar Spearman correlation coefficient (rs). As a complement to the satellite and ground sensor correlation, the wind speed dataset measurements were also integrated into the ODC. This had as an objective to increase the understanding of the dynamics of NO2 according to the meteorological conditions. The wind dataset was obtained from measurements performed by the ARPA Lombardia network weather stations. Correlation results between Sentinel-5P measurements and ARPA Lombardia wind speed measurements show a weak positive correlation (rp = 0.2). As a result, for this work, alternative time periods were tested (e.g. calculating average wind speed in the study points for 12 hour periods). Additionally, seasonality was removed from the results, slightly improving the overall correlation.
On one hand, the results of this work prove the feasibility of integrating non-ARD into the ODC by using the developed Python-ODC pipeline. On the other hand, the conducted numerical experiments reinforce the importance of developing geospatial software architectures capable of managing heterogeneous data formats, with a particular focus on satellite and ground observations which are critical to analysing complex phenomena such as air quality.
Future work will aim at integrating other air quality metrics and meteorological datasets that were not considered in this study. These additional data will be ingested into the ODC to complement existing cube layers. Simultaneously, the operational use of the offered tools and data will be examined in collaboration with local stakeholders, including ARPA Lombardia. Questions about the computer infrastructure needed to enable the creation and publication of the ODC instance will be addressed as well, ensuring that end-users have remote access to an extensive amount of data and analytic tools.