The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualisations and narrative text. Uses with the Earth Observation community include data cleaning and transformation, numerical simulation, statistical modelling, data visualisation, machine learning, and much more. A Jupyter Notebook allows you to combine rich documentation, live and adaptable code, data visualisations. It can also be used as a tool to share your data analysis with others, collaborate, teach, and promote reproducible science.
We are at a particularly exciting time with this technology where many archives are deploying Jupyter Notebook services. These services allow unprecedented access to petabytes of data allowing users from any part of the globe to engage with EO data in a very powerful way. Jupyter Notebooks produced during a research project can very often be the best starting point for new users to engage with data deposited with an archive, however, this raises unique challenges. While Jupyter Notebooks can be a valuable resource, there are issues surrounding input data/processing/technical dependencies and quality. Poor quality notebooks with hidden dependencies may cause new users a lot of problems.
To deal with these issues CEOS (Committee for Earth Observation Satellites) conducted a number of surveys and ran webinars on Jupyter notebooks to gain a better understanding of the EO community needs. We engaged with over 500 people from over 50 countries and two core needs for the wider community became evident. The first was the need for a Jupyter Notebooks best practice document to support the creation and preservation of high-quality reusable notebooks. The second was the need for basic training to get the next generation of researchers ready to engage with emerging services.
We will discuss in greater detail the following key areas to be addressed by a CEOS Jupyter Notebooks Best Practice.
• Notebook description and function
• Structure, workflow, and documentation
• Technical dependencies and Virtual Environments
• Citation of input data and data access
• Association with archived data
• Incorporation with data cubes
• Version control, preservation and archival
• Open-source software licensing
• Publishing software and getting a DOI
• Interoperability and reuse on alternate platforms
• Creating a binder deployment
From recent CEOS WGCapD (Working Group for Capacity Development and Data Democracy) and WGISS (Working Group on Information Systems and Services) meetings, we have seen how many different CEOS agencies are employing Jupyter Notebooks in several different ways. To introduce the broader community, we developed a set of demonstrators that would take you through a technical arc of what is currently possible. Beginning with simple baseline notebooks that have integrated training materials to notebooks that drive heavy-duty processing on the Earth Analytics Interoperability Lab.
Jupyter Hub and Notebooks on Data Analysis Platforms: We looked at two examples from the UK’s JASMIN Jupyter Notebook service, which can access over 20 petabytes of data on the CEDA archive. We then explored the Sentinel 5p global archive of data and demonstrated how to use a very basic Notebook to use the data and answer valuable questions, e.g. how did pollution levels change in large cities during the Covid-19 pandemic? We also looked at a smaller scale specialist example, regional NCEO biomass maps. This helped to demonstrate how, in addition to helping users use Jupyter Notebooks to obtain domain-specific information from data, we can also help them learn technical knowledge and skills related to libraries, modules, and shape files.
Open Data Cube and Google Earth Engine – A Jupyter Notebook Sandbox Demonstration: The Open Data Cube (ODC) Google Sandbox is a free and open programming interface that connects users to Google Earth Engine datasets. This open-source tool allows users to run Python application algorithms using Google’s Colab Notebook environment. This demonstration showed two examples of Landsat applications focused on scene-based cloud statistics and historic water extent. Basic operation of the tool will support unlimited users for small-scale analyses and training but can also be scaled in size and scope with Google Cloud resources to support enhanced user needs.
ESA PDGS (European Space Agency -- Payload Data Ground Segment) Data Cube and Time Series Data: The ESA PDGS Data Cube is a pixel-based access service that enables human and machine-to-machine interfaces for Heritage Missions (HM), Third-Party Missions (TPM) and Earth Explorer (EE) datasets handled at the European Space Agency. The pixel-based access service provides the users with advanced retrieval capabilities, such as time series extraction, data subsetting, mosaicking, band combinations, and index generation (e.g. normalized difference vegetation index (NDVI), anomalies, and more) directly from the EO-SIP packages with no need for data duplication or data preparation.
The ESA PDGS Data Cube service provides both the web-based Explorer user interface (https://datacube.pdgs.eo.esa.int) and Jupyter Notebook (https://jupyter.pdgs.eo.esa.int) to allow users to import, write, and execute code that runs close to the data. This demonstration showcased how to retrieve Soil Moisture time-series using the Jupyter environment in order to generate thematic maps (monthly anomalies map) over an area of interest. The benefit of using the pixel-based service with respect to traditional access services in terms of resources usage was also highlighted.
Earth Analytics and Interoperability Lab – Big Data Processing: The CEOS Earth Analytics Interoperability Lab (EAIL) is a platform for CEOS projects to test interoperability in a live Earth Observation (EO) ecosystem. EAIL is hosted on Amazon Web Services and includes facilities for Jupyter Notebooks, scalable compute infrastructure for integrated analysis, and data pipelines that can connect to new and existing CEOS data discovery and access services. This demonstration showed how we use Jupyter Notebooks with the Python Dask Library to efficiently compute and perform large-scale analyses (10s GB) with interactive plotting and scalable compute resources in EAIL.
Going forward there is a great deal of interest in collaborating and developing these activities further. We will discuss how we will be creating baseline notebooks aimed at developing key EO data science skills and exemplars for the best practice. We anticipate holding a CEOS Jupyter Notebooks day later in 2022. The aim of which will be to stimulate other agencies/organisations to produce similar resources which, will benefit students/early career researchers. Enabling them to engage with Jupyter Notebook services which are emerging globally.
The increase in open Earth Observation data available, the shift from expert users to multi-disciplinary non-expert users, the emergence of cloud-based services and the shift from data to analytics, including Artificial Intelligence, are key trends in the Earth Observation and Space sectors, which require a systematic and new approach on building up necessary capacities and skills. Europe’s vision to become a strong European data space and key policies such as the European Green New Deal strongly depend on having enough trained professionals with adequate technical and data skills who are able to turn Big Earth data into knowledge that informs policies and decision-making. Jupyter notebooks have become the de-facto standard for data scientists and are a great tool that facilitate data-intensive training. However, educators need to integrate didactical concepts, instructional design patterns and best practices for coding when using notebooks for teaching. Defining and implementing best practices for using Jupyter notebooks in EO is pivotal in this regard, especially as the use of Jupyter notebooks in the EO sector increases exponentially.
Since 2019, we have developed the Learning Tool for Python (LTPy), which is a Jupyter-based training course on open satellite- and model-based data on atmospheric pollution and climate, with the aim to build up data, technical and thematic competencies. LTPy features eleven different datasets of principal European satellite missions including the Copernicus satellites Sentinel-3 and Sentinel-5P, the European Polar Satellites Metop-A/B/C with the instruments GOME-2 and IASI operated by EUMETSAT as well as from the Copernicus Atmosphere Monitoring Service implemented by the ECMWF.
LTPy makes use of different components of the Jupyter ecosystem, including a dedicated Jupyterhub training platform, and its structure is aligned with a typical data analysis workflow, with modules on data access, data discovery, case studies and exercises.
In this talk, we would like to share our experiences using Jupyter notebooks in more than 13 in-person and online courses and training events, in which we have reached over 650 Earth Observation practitioners so far. We further would like to share a set of best practices we developed, which offer the possibility to make Jupyter notebooks more ‘‘educational’ and “reproducible” and more useful overall.
This presentation focuses on how the Ellip Solutions from Terradue provide Jupyter Notebooks tailored for reproducible and portable Earth Observation (EO) application packages accessing large EO data collections respecting the FAIR principles. We will address the major scalability and operational deployment shortcomings of JupyterHub/JupyterLab and how they were tackled to provide a processor development environment and operational production flow for the Earth Sciences community.
Nowadays, JupyterLab is easily deployable in distributed cloud native resources such as kubernetes and several organizations and platforms started to include this as part of their service offering. Technically, this deployment includes an instance of JupyterHub that then can spawn JupyterLab instances based on container images.
The default out-of-the-box installation provides limited tooling: a notebook environment and a limited plain text editor. Dedicated kernels can be configured and run thematic and/or scientific Python libraries (e.g. GDAL, GeoPandas, numpy, scipy). In the EO context, data scientists often rely on toolboxes such as SNAP or OTB to process the EO data. Typically, these toolboxes require a large amount of disk space for the required libraries and dependencies. There are several strategies for managing this, each with benefits and drawbacks. A solution includes their pre-installation in the JupyterLab base container image often leading to very large container images and to a version locking. The other is to provide mechanisms to install these toolboxes as part of the kernels. This leads to larger user workspaces (several gigabytes), and as these are often persisted in cloud block storage, high service costs when the user base grows. Finally, ensuring a notebook is reproducible and shareable must be one of the main drivers and this is something that is not available out-of-the-box.
Starting from this problem statement, we decide to provide a JupyterLab service with advanced tooling. Firstly a more advanced Integrated Development Environment (IDE) that simulates the developer comfort of local and modern IDEs (code completion, linting, problem detection, compilers, etc.) but runs alongside with JupyterLab. The solution for this was found using Theia (free and open-source IDE framework for desktop and web applications).
Secondly, by providing the tooling to access a larger storage space using object storage (e.g. S3) our solution allows persisting test or reference EO datasets as well as processing or experiment results.
Thirdly, by providing a container engine that can pull and run existing containers that include the EO toolboxes we ease their utilisation together with access to modern open science techniques to develop portable and reproducible workflows using the Common Workflow Language (CWL). CWL is the workflow standard chosen by the OGC to package EO applications making these runnable in different execution scenarios ranging from local PC execution to massively distributed computing resources exploiting kubernetes clusters or HPC.
Lastly, in our solution, notebooks may be transformed by dedicated tooling into self-sustainable executables in a container. Once packaged in a container, these notebooks can be deployed and invoked from external applications (e.g. OGC API Processes).
Terradue will present its solution to provide access to an advanced deployment of JupyterHub and JupyterLab that addresses the identified problems. The Ellip Studio Solution provides an advanced cloud based environment to write reproducible and portable EO application packages allowing these to be run against large EO collections. The presentation can be complemented with a dedicated training session with hands-on EO application integration exercises.
EODASH is an Open Source software project (https://github.com/eurodatacube/eodash) of the European Space Agency, serving the RACE - Rapid Action on Covid-19 and EO (https://race.esa.int) and the EO Dashboard (https://eodashboard.org) web applications. The two platforms (RACE and EO Dashboard), developed in partnership with the European Commission and NASA and JAXA respectively, aim to provide satellite-informed indicators of societal, environmental, and economic impacts. Initially developed to support research on the ongoing Covid-19 pandemic using Earth Observation data, the projects provide two public platforms where the geographic focus is European (RACE) and global respectively (EO Dashboard).
Leveraging the power of the Euro Data Cube (EDC, https://eurodatacube.com) on top of which these two applications have been developed, the initiatives enabled the development of a large number of exploratory R&D activities and community engagement, in an Open and Reproducible Science approach powered by Jupyter Notebooks.
Here we dive deeper into how Jupyter was used in the frame of RACE and EO Dashboard for experimentation and visualisation in the cloud, looking at: i) the EOxHub Workspace with the managed EDC JupyterLab that enables scripting and execution of Jupyter Notebooks; ii) how Jupyter Notebooks supported reproducibility and enabled ad-hoc teams to easily craft computational narratives on top of the indicator data and EO data from RACE and EO Dashboard during the EODashboard Hackathon and RACE Challenges; iii) the process of building indicator production pipelines in the EDC, algorithm packaging, and headless execution of notebooks.
We discuss moreover best practices, challenges and limits of Jupyter Notebooks in the context of reproducible science, as well as some of the ways forward with the EODASH project as an Open Science and educational resource.
# EOxHub Workspace
The EDC EOxHub Workspaces (https://eurodatacube.com/marketplace/infra/edc_eoxhub_workspace) offer a managed JupyterLab instance with curated base images ready to kick off EO workloads. The offering provides different flavours of computational resources, a network file system for persistent data storage, and a high-speed network connection to run installed Jupyter Notebooks and user deployed Applications.
# RACE Challenges and EO Dashboard Hackathon
The EO Dashboard Hackathon organised in June 2021 celebrated the one-year anniversary of the EO Dashboard's launch and builds on the success of the Space Apps COVID-19 Challenge (https://covid19.spaceappschallenge.org). During a week-long event, over 4000 participants from 132 countries created 509 virtual teams and attempted to solve 10 challenges related to the Covid-19 pandemic using data from the EO Dashboard. Challenge Topics included: Air quality, Water quality, Economic impact, Agricultural impact, Greenhouse gas, Interconnected Earth system impact, and Social impact. During the hackathon, participants had the opportunity to form virtual teams, interact with experts from NASA, ESA, and JAXA in dedicated chat channels, and submit projects. The participants had access to configured personal EDC EOxHub Workspaces including a hosted JupyterLab to run their Python notebooks.
The same technical setup was employed for the RACE Challenges (https://eo4society.esa.int/race-dashboard-challenges-2021/), a series of data science competitions launched by ESA with the purpose to get participants engaged with the RACE Dashboard, its data, and computational resources, so they can process and combine EO and non-EO data to develop new ways of monitoring the impacts of the pandemic.
At ESA, we believe that novel earth observation (EO) missions in combination with open data access and performant open source software have the power to bring the benefits of technological advancement to every aspect of our global society and the environment. ESA's seventh Earth Explorer mission, the BIOMASS mission, will for example provide crucial information about the state of our forests, how they are changing and the role they play in the global carbon cycle. This mission is designed to provide, for the first time from space, P-band Synthetic Aperture Radar measurements to determine the amount of biomass and carbon stored in forests. As the first of ESA’s Earth Explorer missions, BIOMASS is supported by an entirely open scientific process and best-practices for open source science to accelerate scientific discovery. We propose, that the BIOMASS mission platform, data and algorithm activities may be adopted as a lived, transparent, inclusive, accessible and reproducible blueprint for open source scientific best-practices applied in all future Earth Explorer missions.
The mission is accompanied by an open collaborative platform, to access and share data, scientific algorithms and computing resources for open EO science called the Multi-Mission Algorithm and Analysis Platform (MAAP), and an open source software project allowing for open, collaborative development of mission processing algorithms, the BIOMASS mission product Algorithm Laboratory (BioPAL). Openly developing and sharing new tools in combination with computing resources allows to include the scientific user community early and has the potential to accelerate the development of new EO data products and foster scientific research conducted by EO data users. To this end, the open source science model presents a pathway to fostering a collaborative community supporting scientific discovery, by giving the user community more influence in product development and evolution. Integrating open source software development practices and standards into common satellite mission science practices and algorithm development also has the potential to address the challenge of the timely evolution of operational satellite algorithms, higher EO product quality, and fostering a mission science software lifecycle resilient to new arrivals or departures by large, distributed teams.
In this talk I will outline common open source science principles and collaborative development best-practices, presenting a new pathway to fostering scientific discovery through open collaboration within ESA’s Earth Explorer mission science. On the example of ESA's BIOMASS mission, I will give practical guidance on the integration of open source sciences principles into current mission science and introduce BIOMASS mission activities serving as a blueprint for future open source mission sciences activities.
The Alaska Satellite Facility (ASF) maintains the archives of Synthetic Aperture Radar (SAR) datasets held by NASA. As part of ASF's mission to improve access to SAR datasets, we have developed the JupyterHub-based platform, OpenSARlab. Hosted alongside the ASF archives in AWS, OpenSARlab allows low-latency, programmatic access and manipulation of data directly in the cloud. It is an open-source, deployable service, which is easily customizable to suit the needs of a given user group. As the ASF Distributed Active Archive Center (DAAC), our focus is on Synthetic Aperture Radar (SAR) data, so the ASF DAAC version of OpenSARlab includes a wide array of tools used by the SAR community.
OpenSARlab provides users with Jupyter Notebook and JupyterLab computing environments that contain a collection of tools suited to their specific needs. Whether collaborating on a project or enrolled in a class, users can get to work quickly, with minimal setup and the assurance that all their colleagues and classmates are operating in identical environments. This ensures that valuable time is not wasted debugging software installations, and final processing results are reproducible.
OpenSARlab users are authenticated and have persistent storage volumes so they can leave and return to their work without losing their progress or having to download anything. Science code developers working in OpenSARlab have the flexibility to further customize their workflows by installing additional software and creating their own conda environments.
OpenSARlab is ideal for hosting large, virtual training sessions since it is in the cloud and can scale to accommodate any number of simultaneous users. We regularly provide OpenSARlab deployments to host such events, some of which do not have a SAR focus, and all of which have varying needs in terms of software, compute power, memory, and storage.
When a class ends and an OpenSARlab deployment is retired, we offer scaled down docker images to users, allowing them to create single-user versions of the same JupyterLab or Jupyter Notebook environments used in the class on their local computers.
Co-authors:
Alex Lewandowski, Kirk Hogenson, Rui Kawahara, Tom A Logan, Eric Lundell, Rebecca Miller, Tim Stern, Franz J Meyer