Weather-related hazards are ubiquitous around the world including: 1) hurricane storm surges, 2) rapid snowmelt and heavy rainfall, 3) severe weather leading to flash floods, and 4) seasonal freeze and thaw of rivers that may lead to ice jams. Each of these hazards affects human settlements and has the potential to impact agricultural productivity. In each setting, end-users in disaster management need access to data processing tools helpful in mapping past and current disasters. Analysis of past events supports risk mitigation by understanding what has already occurred and how to alleviate those impacts in the future. Having capabilities to generate the same products in a response setting means that lessons learned from risk analysis will carry forward to event response. Synthetic aperture radar (SAR) data from the ESA’s Sentinel-1 mission are particularly useful for these activities due to their all-weather 24/7 monitoring capabilities, global and regular sampling strategy, and free-and-open availability.
Here we present HydroSAR, a scalable cloud-based SAR data processing service for the rapid generation of hazard information for hydrological disasters such as flooding, storm surge, hail storms, and other severe weather events. The HydroSAR project is led by the University of Alaska Fairbanks in collaboration with the NASA Alaska Satellite Facility (ASF) DAAC and the NASA Marshall and Goddard Space Flight Centers. As part of this project we developed a series of SAR-based value-added products that extract information about surface hydrology (geocoded image time series, change detection maps, flood extent information, flood depth maps) and flood impacts on agriculture (crop extent maps, inundated agriculture information) from Sentinel-1 SAR data.
To enable automatic, low-latency generation and distribution of these products, HydroSAR utilizes a number of innovative software development, cloud computing, and data access technologies. HydroSAR is fully implemented in the Amazon Web Service (AWS) cloud, resulting in a high-performance and scalable production environment capable of continental-scale data production. The service is co-located with ASF’s AWS-based Sentinel-1 data holdings to accelerate data access, reduce data migration, and avoid egress costs. The HydroSAR implementation heavily leverages ASF’s cloud-support services such as OpenSARLab (cloud-based algorithm development) and the ASF Hybrid Pluggable Processing Pipeline (HyP3; cloud acceleration and elastic scaling). HydroSAR is fully open source. All HydroSAR technology is Python based and accessible via GitHub, and HydroSAR products are stored in publicly accessible AWS S3 cloud storage. To ease integration into GIS environments and other web services, HydroSAR products are served out using REST endpoints and ArcGIS Image Services.
In our presentation we will introduce the SAR-based products that were developed for this effort. We will describe our approach for product calibration/validation and provide metrics for product performance. We will outline the cloud-based production pipeline that was built to automatically generate our data products across large spatial scales. We will demonstrate the benefit of ArcGIS image services for data sharing and data integration. Finally, we will showcase the integration of HydroSAR products into disaster response activities for a number of recent disasters including the 2021 U.S. hurricane season, the 2021 Alaska spring breakup flooding, and the 2021 flood season in Eastern India, Bangladesh, and Nepal.
The latest developments in big cloud computing environments, supporting ever increasing volumes of Earth Observation (EO) data, present exciting new possibilities in the field of EO. Nevertheless, the paradigm shift of deploying algorithms close to the data sources also comes with new challenges in terms of data handling and analysis methods. The Euro Data Cube (EDC) was developed in this regard, to facilitate large-scale computing of EO data by providing a platform for flexible processing, from the local to the continental scale. Not only does the EDC offer a workspace with customisable levels of computational resources, it also provides on-demand cloud-optimised fast access to tens of petabytes of satellite archives from open and commercial sources with proven scalability. The platform also offers a growing ecosystem of inter-compatible modules supporting users in all steps of data analysis, processing and visualisation.
To showcase the potential of the EDC, we will present how the platform supported the development of algorithms for a spatio-temporal drought impact monitoring system in Europe. Using new Copernicus Land Monitoring Service (CLMS) High Resolution Vegetation Phenology and Productivity data (HR-VPP) we produced indices on vegetation dynamics as a response to soil moisture deficit. The exercise was run at a 10-meter resolution for the period 2017-2020 over Europe. Owing to the large volume of the vegetation and soil moisture anomaly time series (0.6 PB), the datasets could not be processed using local computing resources. Simply adding more computing resources or even using a High Performance Computing (HPC) facility would not solve the problem either, given the large volumes of regularly-updated data needed to be transferred between clouds. The use of the EDC deployed on CreoDIAS, which also host the HR-VPP data as part of the WEkEO infrastructure, made it possible to produce efficient, transparent, repeatable and ad-hoc updates of drought impact indicators, which are an essential input to the European Environment Agency supporting European Policy making.
The use-case presented is just one example of how the EDC can be used to monitor continental phenomena at high spatial and temporal resolutions. Stemming from this example, we will present the EDC technology stack and services at the disposal of users to develop and deploy large-scale algorithms that in turn, can be triggered as on-demand algorithm executions in a Software-as-a-Service style. We will also discuss the data scientist’s workflow and adjustments that were needed in this regard, to transfer a typical remote sensing exercise into a scalable process.
Many global and continental scale mapping projects today have to solve the same problems: finding an infrastructure with the required datasets, setting up a scalable processing system to produce results, configuring preprocessing workflows to generate Analysis Ready Data (ARD), and then scaling from test runs to the final continental or global map. After that appropriate metadata needs to be generated followed by a dissemination phase where products have to be made ready for viewing. OpenEO platform solves most of these problems, but how does that work out in a real-world scenario?
To demonstrate the efficiency and useability of OpenEO platform (https://openeo.cloud) we have generated a European crop type map at 10m resolution. The map is based on both Sentinel-1 and Sentinel-2 data, which makes it a suitable blueprint for many other mapping projects. The feature engineering capabilities of the platform transform this data into a set of training features, which are then sampled across Europe to build a training and validation dataset. The built-in classification processes are finally used to train a model, which is used subsequently to generate the final map, which is delivered as a set of cloud optimized geotiffs with SpatioTemporal Asset Catalog (STAC) metadata. The final product will be validated using a harmonized LPIS dataset based on the availability at country level.
In this presentation we will give an overview of the full product R&D and production lifecycle. We report on efficiency gains and potential remaining bottlenecks for users. This gives a good overview of platform capabilities, but also of the overall maturity with respect to being production-ready for continental scale mapping efforts. For the large scale production of the map, processing capacity available at VITO, CreoDIAS and EODC will be used containing large and local, data archives extended with data on Sentinel Hub accessed via Euro Data Cube. This federation of data and processing capacity is made transparent by openEO platform. A built-in large scale data production component is responsible for distributing the workload across the available infrastructure. This aims to show that OpenEO platform has reached a maturity level that allows users to engineer an entire ML pipeline from data to final classification on a large scale.
The WorldCereal project aims at developing an efficient, agile and robust Earth Observation based system for timely global crop monitoring at field scale. The resulting platform is a self-contained, cloud agnostic, open-source system.
The global agricultural monitoring requires a solution to an increasingly complex computational problem. Therefore, high performance computing became a critical resource and that is why the need of a processing cluster arose. The project delivers an open-source EWOC system which automatically ingests and processes Sentinel-1, Sentinel-2 and Landsat 8 time series in a seamless way to develop an agro-ecological zoning system and disseminate high added value products to the users through the visualization portal using the geoTIFF COG format.
The system can be installed over a cluster which is driven by Kubernetes. The cluster itself may be hosted by a provider of choice. (CreoDIAS, AWS, etc... )
A set of open-source tools have been used for creating the process framework. Python has been mostly used for connecting the different processors and making them run. To achieve the results, the cluster has first to be scaled up to several desired nodes which are interconnected computers working as a single system. The operating system on these nodes is Linux. Furthermore, whenever a processing is needed (due to major growing season ending or by a user’s wish), a workflow started by Argo tool is scheduling the pods over the previously spawned nodes. These pods are in fact a collection of self-contained (i.e., shipped as Docker containers) applications or systems that produce high value added EO products such as crop masks, crop type maps and irrigation maps. The pre-processing chains for Sentinel 1, Sentinel 2 and Landsat 8 are first run in a parallel processing to provide enough dataset for the classification processor which in turn is providing the WorldCereal products.
At the center of the system is a Postgresql database which is used to organize the pending tasks and the upcoming productions. This precise tracking allows resilience in case a process failure, a re-processing is then automatically scheduled by Argo.
The information regarding the processing of the results as well as the S3 bucket paths to these results are stored in the database which grows in time. The progress of a running workflow may be monitored over a map which contains an overview of the tiles which cover the area for which the processing is run.
The goal of the cluster itself is to simplify the parallel pre-processing and the classification processing with the use of multiple data-sources as input. In case of a node failure (because of different causes like spot instance node termination in case of AWS provider, or a hardware failure in case of CreoDIAS provider) the system may recover whenever a new node is available.
Current system demonstration instance, hosted on a DIAS, benefits from the extensive amount of already available Earth Observation Data. This synergy offered by the cloud provider and the associated available data offers the perfect operational context for our cloud-based system. For additional data resources, the EWoC system seamlessly interfaces with external providers using its own dedicated data retrieval abstraction layer. This last component, overcoming some limitation of DIAS data offer could benefit, as an open-source piece of software, to other projects.
The EWoC system is a demonstration of an operational, open-source, cloud-based processing platform from which other future projects can benefit through its many reusable components.
Big data always presents new challenges, which can lead to innovation and potential improvements. At EOX we were recently at a breaking point and have reworked our EOxCloudless (https://cloudless.eox.at) data processing infrastructure almost from scratch. For those unfamiliar with the label this data is the one to generate our satellite map layer known as “Sentinel-2 cloudless” (https://s2maps.eu).
# Splitting the responsibilities - Kubernetes
Autoscaling is a tough problem and we know that managing large compute clusters is a responsibility we should delegate to our engineering team. Using Kubernetes
(https://kubernetes.io/de/docs/concepts/overview/what-is-kubernetes/) as the core abstraction to interface between our teams was an undisputed decision based on our experience in other operational deployments of EOxHub like “Euro Data Cube” (https://eurodatacube.com/), “Earth Observing Dashboard” (https://eodashboard.org/), and “Rapid Action on coronavirus and EO” (https://race.esa.int/). Our EOxHub technology, able to mediate on top of Kubernetes with platform resources on different clouds, enables us to both handle the workloads and to automatically scale the infrastructure based on processing needs.
# Splitting the workload - dask
With only a handful of options for handling concurrent/parallel workflows in python, we have opted for the meanwhile well established “dask” (https://dask.org/). For local processing we have adapted the “dask - Futures”, which is an extension for natice python‘s “concurrent futures'' (https://docs.python.org/3/library/concurrent.futures.html), to our use case. This with the help of “dask.distributed” (https://distributed.dask.org/en/latest/) runs concurrently a huge number of tasks locally and on a cluster as well.
# Geographical splitting/tiling of the workload - tilematrix
What is a “task” in this scenario? In “dask futures” itself it can be any arbitrary python function which runs itself and returns a result, the future is collected after completion of any single task.
For splitting geospatial data we have written our custom tiling package for regular matrices called “TileMatrix” (https://github.com/ungarj/tilematrix). Apart from the default EPSG:4326 and EPSG:3857 grids, it allows defining custom grids in any usual coordinate reference system. These grids provide a tile structure following a classical tiling scheme {zoom}/{row}/{column} ({z}/{y}{x}.{extension}). This is similar to the “OGC Two Dimensional Tile Matrix Set” (https://docs.opengeospatial.org/is/17-083r2/17-083r2.html).
# Mapchete meets dask
How to run a mosaicking script handling thousands of Sentinel-2 scenes distributed over the globe?
By doing exactly this even before this update for several years with different tools, for us this only took adding the dask know-how into the mix. Yet another python package originating from EOX called “Mapchete” (https://github.com/ungarj/mapchete) was extended by the “dask.distributed” package opening it to the dask environment, which gave it a direct way to be connected to a couple of cluster connection options offered by the dask community.
# Sentinel-2 data access optimization
Apart from extended concurrency options, getting a hand on a huge amount of Sentinel-2 data can be somewhat costly, difficult, and time-consuming. To shed some light on this, let’s look over some of the options that people interested in doing similar processing might consider doing.
Sentinel-2 data access:
1. https://scihub.copernicus.eu/
2. https://www.copernicus.eu/en/access-data/dias
Sentinel-2 data at Amazon s3 (AWS s3):
1. https://registry.opendata.aws/sentinel-2/
2. https://registry.opendata.aws/sentinel-2-l2a-cogs/
Let’s start with some of the basics, apart from the regular access to Sentinel-2 data, AWS offers under its “opendata” policy the datasets we need to produce our annual global mosaic, therefore we opt for AWS as the cloud and data provider. From AWS (1.) we can get all the metadata we need to merge the granule products into datastripes (acquisition stripes) and the angles and geometries for the angular correction (BRDF) to compensate for the Sentinel-2 L2A sen2cor imbalanced values in the west-east direction. The disadvantage of the AWS number one (1.) is that most of the data requests are currently under “requester-pays” settings, which for a couple of billion requests induces a not negligible cost.
Thankfully the AWS number two (2.) has currently no such restriction and it has one more advantage. The Sentinel-2 data there are stored as Cloud Optimized Geotiffs (COGs), which improves GDAL reading speed significantly as well as being free to make requests from the AWS s3 storage. Both of the AWS s3 endpoints provide “Spatio Temporal Asset Catalog” (STAC) accessibility, which makes search and browse extremely convenient once the STAC structure and tools have been understood (https://stacspec.org/).
The data is being read and BRDF corrected by custom mapchete extensions, which take care of all the metadata transformation, reading, reprojection, and corrections. After that they are submitted to the mosaiking process, which gets the masks and finally evaluates the timeseries based on provided settings to deliver a homogenous, cloudless mosaic output.
# Scale up with Dask-Kubernetes
Finally it comes down to scaling up the processing from the local environment to something much bigger. Each task needs to be processed not on one server or Virtual Machine (VM), but on a random dask-worker within the kubernetes cluster.
The advantage of this is that dask cluster supports adaptive dask-worker scaling, which directly controls the kubernetes resources scaling as well, this requests and launches cloud resources only when needed and shutting them down when idle.
The “dask-kubernetes”(https://kubernetes.dask.org/en/latest/) is paired with a so called “dask-gateway” (https://gateway.dask.org/) to handle hardware specifications and optionally to have different tiers of users to not allow everyone to use all the resources.
For the overall management and config of the kubernetes a Flux (https://fluxcd.io/) configuration has been set up to handle the top control of the kubernetes environment.
# Use the output data
Lastly, let's see how we can use the generated data. The output data organized as GeoTIFF Tile Directories can be read by the GDAL STACTA (Spatio Temporal Asset Catalog - Tiles Assets) driver (https://gdal.org/drivers/raster/stacta.html), which allows users to get, convert, or subset the data with GDAL directly from object storage.
Furthermore with some of the recent updates to OpenLayers (https://openlayers.org/) GeoTIFFs organized as STACTA can be loaded in the browser directly as shown in the exampleat https://openlayers.org/en/latest/examples/cog-pyramid.html. This provides the option to support all sorts of interactive rendering and analysis of the original 16bit data values directly in the browser.
# Conclusion
By the time of the Living Planet Symposium (LPS 2022) in May 2022 the next update of the EOxCloudless satellite map for the year 2021 will be available at https://s2maps.eu/. With that at hand we will have the results of an actual global benchmark for this updated big data processing workflow to present.
The presentation of this work will include the presentation of the workflow as described here and a live demonstration, partly as a tutorial.
We present here the recent progress of the DACCS (Data Analytics for Canadian Climate Services) project [1], funded by the Canadian Foundation for Innovation (FCI) and various provincial partners in Canada, in response to a national cyberinfrastructure challenge. The project’s development phase is set to finish in 2023, while funding for maintenance is planned over ten years. DACCS leverages previous development efforts of PAVICS [7], a virtual laboratory facilitating the analysis of climate data, without the need to download it. A key aspect of DACCS is to develop platforms and applications that combine both climate science and Earth observation, in order to stimulate the creation of architectures and cyber infrastructures shared by both.
In climate sciences, it is now common to manipulate very large and highly dimensional volumes of data derived from climate models such as the CIMIP5 and soon CIMIP6 datasets. In Earth Observation (EO), the concepts of time series and ARD data also lead us to manipulate multidimensional data cubes. Consequently, the two domains share a common set of requirements, especially in terms of visualization services and deployment of applications close to the data. The overall philosophy of the platform is to provide a robust backend with interoperable services with the ability to deploy processing applications close to the data (A2D).
The proposed platform architecture adopts some of the characteristics of the EO Exploitation Platform Common Architecture (EOEPCA) [12], including user access, data discovery, user workspaces, process execution, data ingestion and interactive development. This architecture describes A2D use cases where a user can select, deploy and run applications on distant infrastructures, usually where the data resides. An Execution Management Service (EMS) transmits such a request to an Application Deployment and Execution Service (ADES) [8].
Python notebooks are offered as the main user interaction tool. This way, advanced users can keep the agility of Python objects, all while offering the option to call demanding A2D tasks via widgets. Beyond the technical contribution, the platform offers an opportunity to explore synergies between the two scientific domains, for example to create hybrid machine learning models [6].
The platform ingests Sentinel-1 & 2 and Radarsat Constellation Mission (RCM) images in order to create multi-temporal and multi-sensor datacubes. The EO ingestion is performed by Sentinel Application Platform (SNAP) [13] workflows packaged as application inside Docker containers [11]. The goal is to provide researchers the ability to create data cubes on the fly via python scripts, with an emphasis on replicability and traceability. Resulting data cubes will be stored in NetCDF format along with the appropriate metadata to be able to document data provenance and guarantee replicability of the processing. Workflows and derived products will be indexed in a STAC catalog. The platform is aiming at facilitating the deployment of typical processing tasks including standard ingest processing (w.g. calibration, mosaicking, etc.), but also advanced Machine Learning tasks.
Aside from EO use cases, DACCS project should advance PAVICS climate analytics capabilities. The platform readily allows access to several data collections ranging from observations, climate projections and reanalyses. PAVICS relies on the Birdhouse Framework [9], a key component of the Copernicus Climate Data Store [10]. PAVICS is the backbone of the analytical capabilities of ClimateData.ca, a portal created to support and enable Canada’s climate change adaptation planning, and improve access to relevant climate data.
Climatedata.ca provides access to a number of climate variables and indices, either computed from CMIP5 and CMIP6 projections downscaled to Canada [3], or from historical gridded datasets [4]. Variables can be visualized on maps or on graphs, either on the basis of their grid cell element or by spatial aggregates, such as administrative regions, watersheds or health regions. Climatedata.ca also provides up-to-date meteorological station data, standardized precipitation evapotranspiration index (SPEI) [5], as well as sea-level rise projections.
The portal’s Analyze page [14] allows users to select their grid cells or regions, and to set the thresholds, climate models, RCPs and percentiles they would like to use for analysis. User queries are then sent to the PAVICS node through OGC API - Processes. Computations are realized by xclim[2], a library of functions to compute climate indices from observations or model simulations. The library is built using xarray and benefits from the parallelization handling provided by dask.
References
[1] Landry, T. (2018). "Bridging Climate and Earth Observation in AI-Enabled Scientific Workflows on Next Generation Federated Cyberinfrastructures", AI4EO session, the ESA Earth Observation Φ-week, November 2018, ESA-ESRIN, Frascati, Italy.
[2] Logan, T. et al. (2021): xclim - a library to compute climate indices from observations or model simulations. Online. https://xclim.readthedocs.io/en/stable/
[3] Cannon, A.J., S.R. Sobie, and T.Q. Murdock (2015): “Bias Correction of GCM Precipitation by Quantile Mapping: How Well Do Methods Preserve Changes in Quantiles and Extremes?”, Journal of Climate, 28(17), 6938-6959, doi:10.1175/JCLI-D-14-00754.1.
[4] McKenney, Daniel W., et al. (2011): "Customized spatial climate models for North America." Bulletin of the American Meteorological Society, 92.12: 1611-1622.
[5] Tam BY, Szeto K, Bonsal B, Flato G, Cannon AJ, Rong R (2018): “CMIP5 drought projections in Canada based on the Standardised Precipitation Evapotranspiration Index”, Canadian Water Resources Journal 44: 90-107.
[6] Requena-Mesa, Christian, et al.(2020) "EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts." arXiv preprint arXiv:2012.06246, https://www.climatechange.ai/papers/neurips2020/48
[7] The PAVICS platform. Online. https://pavics.ouranos.ca/
[8] Landry et al. (2020): “OGC Earth Observation Applications Pilot: CRIM Engineering Report”, OGC 20-045, http://www.opengis.net/doc/PER/EOAppsPilot-CRIM
[9] C. Ehbrecht, T. Landry, N. Hempelmann, D. Huard, and S. Kindermann (2018): “Projects Based in the Web Processing Service Framework Birdhouse”, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci., XLII-4/W8, 43–47
[10] Kershaw, Philip, et al. (2019): "Delivering resilient access to global climate projections data for the Copernicus Climate Data Store using a distributed data infrastructure and hybrid cloud model." Geophysical Research Abstracts. Vol. 21.
[11] Gonçalves et al. (2021): “OGC Best Practice for Earth Observation Application Package”, OGC OGC 20-089, Candidate Technical Committee Vote Draft, Unpublished
[12] Beavis P., Conway, R. (2020), A Common Architecture Supporting Interoperability in EO Exploitation, ESA EO Phi-Week 2020, Oct. 1st 2020. https://eoepca.org/articles/esa-phi-week-2020/
[13] Zuhlke et al. (2015), “SNAP (Sentinel Application Platform) and the ESA Sentinel 3 Toolbox”, Sentinel-3 for Science Workshop, Proceedings of a workshop held 2-5 June, 2015 in Venice, Italy.
[14] ClimateData.ca Analyze page: https://climatedata.ca/analyze/