1. Abstract
As many other research fields, remote sensing has been greatly impacted by machine and deep learning and benefits from technological and computational advances. In the recent years, a considerable effort has been spent on deriving not just accurate, but also reliable modeling techniques. In the particular framework of image classification, this reliability is validated by e.g. checking whether the confidence in the model prediction adequately describes the true certainty of the model when confronted with unseen data. We investigate this reliability in the framework of classifying satellite images into different land cover classes. More precisely, we use the So2Sat LCZ42 data set [1] comprised of Sentinel-1 and Sentinel-2 image pairs. Those were classified into 17 categories by a team of two labelers, following the Local Climate Zone (LCZ) classification scheme.
As a novelty, we make explicit use of the so-termed evaluation set which was additionally produced by the authors of the LCZ42 data set. In this supplementary study, a subset of the initial data was re-labeled by 10 different remote sensing experts, which independently of one another re-cast their label votes for each satellite image. The resulting sets of label votes contain a notion of human uncertainty associated with the underlying satellite images. In the following, we try to explicitly incorporate this uncertainty into the training process of a neural network classifier and investigate its impact on model performance. Also, the earlier introduced definition of reliability is checked and compared to a more common modeling approach. The more common approach is using a single ground truth as label, which is derived from the majority vote of the individual expert label votes.
2. Methodology
The 17 LCZs describe the urbanization of certain cities and are comprised of 10 classes related to built-up areas (urban classes) and 7 classes related to surrounding land cover (non-urban classes). The evaluation data set, which we will use for modeling purposes in the following, consists of 10 European cities as well as additional areas from around the globe which are added for class balancing reasons. A total of ca. 250.000 Sentinel-1 and Sentinel-2 image pairs are included, with a total of 10 spectral bands and 8 statistics derived from the VV-VH dual Pol SLC Sentinel-1 data. Each image is of size 32 by 32 pixels and covers an area of 320m by 320m. For simplicity, we will focus our analysis only on the Sentinel-2 data. Accompanying each satellite image, 10 individual expert label votes are provided. These votes are aggregated for each image by forming the empirical distribution over the different classes. As a result, we receive a distributional label form that stores the information from the individual label votes. Additionally, we store the majority vote of the experts for each image, which serves as a pseudo ground truth label. In case of a tie, the label cast by the two initial labelers was also considered for the determination of the majority vote. Due to the overall high rate of agreement among the voters within the non-urban classes, solely the images associated with the urban classes are considered for modeling the distributional labels in the following.
As a result, the human uncertainty is now saved within the derived label distribution. For the purpose of integrating this information into the classification task, two main changes are made to an already existing deep neural network. First, the usual one-hot encoded labels are replaced by the computed distributional labels. Furthermore, the typical cross-entropy loss gets replaced by the Kullback-Leibler (KL) divergence. This is done to better reflect the task of approximating the ground truth distribution, which is formed by the label votes, from an information theoretic perspective. The training is performed as usual by backpropagating the loss through the network. For evaluating the predictive uncertainty of the model, we investigate the so termed expected calibration error (ECE). The ECE is derived from comparing the model confidence (i.e. the highest predicted class probability of the model) and the corresponding accuracy on the hold-out test set. The discrepancies between the two quantities can further be visualized in a 2D bar plot called reliability diagram.
3. Experiments & Results
We use the benchmark model for the data set from a previous study [2], in which the authors found this model to be superior over many common Convolutional Neural Network (CNN)-based architectures. This benchmark model is termed Sen2LCZ, and builds on the combination of conventional convolutional blocks, the fusion of multiple intermediate deep features and double-pooling. Our implementation results in a network depth of 17 and uses a dropout after the second and third block. The evaluation data set was split into geographically separated training and testing sets. The latter was furthermore randomly split into validation and testing data.
Two separate implementations of the benchmark model were evaluated in order to identify the impact of explicitly modeling the human uncertainty in the labels. The classical approach employed the one-hot encoded labels based on the majority vote of the label votes together with the typically used cross-entropy loss. On the other hand, the modified model utilized the earlier described distributional labels as well as the KL divergence as loss. Apart from that, identical architectures, hyperparameters and training setups were applied. The usual performance metrics were derived on the identical unique test set for both implementations.
As a first and foremost result, all metrics including overall accuracy, average accuracy (both macro and weighted) as well as the kappa score, improved by at least 1 percentage point when using the distributional labels. Note that for deriving these metrics in the presence of distributional labels, the majority vote (i.e. the mode of the distributional label) was taken as ground truth, and the prediction was counted as correct if this ground truth was matched by the highest predicted class probability. Furthermore, the cross-entropy between the predicted probabilities and the ground truth one-hot labels could be reduced by ca. 20% on the test set by training with distributional labels. The central result of this work can be moreover seen in the accompanying visualization, which shows the reliability diagrams of the two implementations: The expected calibration error could be reduced by a large margin (cut more than half) via incorporating the label distributions, and overconfidence could be avoided. The average confidence matches the overall accuracy, and furthermore the two quantities are closely related for almost the entire spectrum.
4. Conclusion
The last reported result shows the clear advantage of integrating label uncertainty into the training process of a neural network for the task of classifying satellite images into LCZs. Adding to that, the integration is superior over classical calibration methods as it also led to improved model performance metrics and a reduced loss on the test set. The derivation and implementation of the distributional labels are straightforward and easy to use. As a main outcome, we would like to emphasize the large improvement in the calibration of the predictive distribution. In particular, the predicted probabilities of the model using the distributional labels can be solidly interpreted and adequately reflect the uncertainty in the prediction.
References:
[1] Zhu, X. X., Hu, J., Qiu, C., Shi, Y., Kang, J., Mou, L., Bagheri, H., Hua, Y., Huang, R., Hughes, L.H., Li, H., Sun, Y., Zhang, G., Han, S., Schmitt, M., Wang, Y. (2020). So2Sat LCZ42: A benchmark data set for the classification of global local climate zones. IEEE Geoscience and Remote Sensing Magazine (GRSM), 8(3), 76-89.
[2] Qiu, C., Tong, X., Schmitt, M., Bechtel, B., & Zhu, X. X. (2020). Multilevel feature fusion-based CNN for local climate zone classification from sentinel-2 images: Benchmark results on the So2Sat LCZ42 dataset. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 2793-2806.
Understanding regional carbon dioxide (CO₂) surface fluxes is an important problem in climate science. To estimate these surface fluxes the usual approaches are based on inverse modeling using atmospheric CO₂ observations. In addition to CO₂ measurements, other gases have shown to be linked to CO₂ fluxes. Nitrogen dioxide (NO₂) has been used as a proxy for anthropogenic CO₂ emissions, both at regional scale (Hakkarainen et al., 2016; Reuter et al., 2014) as well as at individual power plant or city level (Hakkarainen et al., 2021; Reuter et al., 2019). Solar-induced fluorescence (SIF) also plays a role as indicator of vegetation gross primary production. Carbon monoxide (CO) is connected with biomass-burning emissions (Lin et al., 2020).
In this work we take a machine learning (ML) approach to predict global monthly fluxes of CO₂ based on satellite observations of CO₂ and SIF from NASA's Orbiting Carbon Observatory-2, OCO-2, NO₂ observations from the OMI instrument on board of NASA's Aura satellite and CO observations from MOPITT/Terra. We do not use the geographic location of these observations in our model, as we are interested in a model independent of the location. As training data for CO₂ fluxes, we use monthly estimates from CarbonTracker CT2019b. We focus on the years 2015–2021, as OCO-2 was launched in 2014. Since current CarbonTracker CT2019b global estimates go until December 2018, we use observations from 2015 to 2017 as training data and 2018 measurements as test data.
After comparing different ML regression models, we conclude that the best option is to use a XGBoost model, which is the one with lowest mean absolute error. We show that the monthly CO₂ fluxes predicted by our model agree with those derived by the CarbonTracker CT2019b, with small differences in certain areas and months. Our results indicate that NO₂ measurements play the most important role to derive CO₂ fluxes, followed by SIF observations. This supports the importance of NO₂ to detect anthropogenic CO₂ emissions. We make further predictions for the years 2019 to 2021 and detect the reduction of CO₂ emissions due to the effect of COVID-19 lockdowns in 2020.
References
Hakkarainen, J., Ialongo, I., Tamminen, J., 2016. Direct space-based observations of anthropogenic CO₂ emission areas from OCO-2. Geophysical Research Letters 43, 11,400–11,406. doi:https://doi.org/10.1002/2016GL070885.
Hakkarainen, J., Szeląg, M.E., Ialongo, I., Retscher, C., Oda, T., Crisp, D., 2021. Analyzing nitrogen oxides to carbon dioxide emission ratios from space: A case study of Matimba Power Station in South Africa. Atmospheric Environment: X 10, 100110. doi:https://doi.org/10.101/j.aeaoa.2021.100110.
Reuter, M., Buchwitz, M., Hilboll, A., Richter, A., Schneising, O., Hilker, M.,Heymann, J., Bovensmann, H., Burrows, J.P., 2014. Decreasing emissions of NOx relative to CO₂ in East Asia inferred from satellite observations. Nature Geoscience 7, 792–795. URL: https://doi.org/10.1038/ngeo2257, doi:10.1038/ngeo2257.
Reuter, M., Buchwitz, M., Schneising, O., Krautwurst, S., O’Dell, C.W., Richter, A., Bovensmann, H., Burrows, J.P., 2019. Towards monitoring localized CO₂ emissions from space: co-located regional CO₂ and NO₂ enhancements observed by the OCO-2 and S5P satellites. Atmospheric Chemistry and Physics 19, 9371–9383. URL: https://acp.copernicus.org/articles/19/9371/2019/, doi:10.5194/acp-19-9371-2019
Iceberg calving has a strong impact on the internal stress field of marine terminating glaciers and is, therefore, an important indicator for dynamic glacier changes like discharge, acceleration, thinning and retreat. An accurate parameterization of iceberg calving is essential for constraining the glacial evolution and considerably improves simulation results when projecting future sea level contributions. Consequently, temporally and spatially comprehensive datasets of calving front locations are crucial for a better understanding and modelling of marine terminating glaciers. The increasing availability and quality of remote sensing imagery enable us to realize a continuous and accurate mapping of calving front locations. However, the dramatic increase in data volume also accentuates the necessity for automated and scalable delineation strategies.
Due to advances in the field of machine learning, deep artificial neural networks (ANN) are becoming the model of choice for solving complex image processing tasks. Recent studies have already explored the application of these tools for glacier front delineation with very promising results. Rather than simply adding to these studies, we assess the importance of potential input data layers. In particular, we focus on optical Landsat imagery exploiting the full range of multi-spectral capabilities, a statistical textural feature analysis and external topography model data. We estimate their effects on prediction performance through a dropped-variable approach. To do this, we utilize high performance computing systems and re-train our ANN model explicitly removing certain input features. The associated reference dataset comprises more than 1000 satellite images over 23 of the most important Greenlandic outlet glaciers from 2013 to 2021. Resulting feature importances emphasize both the potential in integrating additional input information as well as the significance of their thoughtful selection. We advocate utilizing multi-spectral features, as their integration results in more accurate predictions compared to conventional single-band inputs. This is especially pronounced for challenging ice-mélange, illumination and calving conditions. In contrast, the application of both textural and topographic inputs cannot be recommended without reservation. Their application results in model overfitting which is indicated by a lower accuracy on the validation dataset. The results presented in this contribution reinforce existing efforts for ANN based calving front mapping but also lay the foundation for further applications and developments.
Mapping forest structure at global scale is an important component in understanding the Earth’s carbon cycle. Several new space missions have been developed to support this goal by measuring forest structure predictive of biomass and carbon stock. Furthermore, forest structure characterizes habitats and is thus key for biodiversity conservation. NASA’s Global Ecosystem Dynamics Investigation is one of these missions, which is the first space-based LIDAR designed to measure forest structure (Dubayah et al., 2020). Despite atmospheric noise, the on-orbit full waveforms measured by GEDI are predictive of canopy top height (Lang et al., 2022). Ultimately, these sparse waveforms and derived canopy height metrics will be used to produce global biomass products with 1-km resolution (Dubayah et al., 2020).
Nevertheless, there is a need for high spatial and temporal resolution maps to make informed localized decisions and to improve carbon emission estimates caused by deforestation. Here we present our probabilistic deep learning approach to estimate wall-to-wall canopy height maps from ESA’s optical Sentinel-2 images with a 10 m ground sampling distance. A deep ensemble of fully convolutional neural networks is trained to regress canopy top height using sparse GEDI reference data (Lang et al., 2022). Not only does this approach extend our previous work (Lang et al., 2019, 2021, Becker et al. 2021) from country-level modelling to a global scale, it also yields the predictive uncertainty of the final canopy height estimates. In other words, the model estimates the variance of its predictions indicating in which cases the predictions are less trustworthy.
To enable such a globally trained model to adjust for regional conditions, the geographical coordinates are used as additional inputs to the Sentinel-2 bands. Furthermore, canopy height follows a long-tail distribution, i.e. tall trees are very rare. Thus, a new balancing strategy is developed to reduce the underestimation of tall canopies while preserving the calibration of the predictive uncertainty estimates.
The model performance is evaluated globally on held-out GEDI reference data from randomly selected Sentinel-2 tiles, corresponding to 100 km x 100 km regions. In addition, the resulting maps are compared to dense canopy top height maps (RH98) derived from NASA’s LVIS airborne LIDAR campaigns (AfriSAR, ABoVE/GEDI). On the held-out data the model achieves an RMSE of 5.0 m and a ME of 0.5 m, which indicates a slight overestimation w.r.t. GEDI reference heights. The final, dense predictions are in good agreement with the LVIS derived RH98 and yield an RMSE of 8.8 m and a ME of 0.2 m. Both the usage of geo-coordinates and the balancing strategy reduce the saturation of high canopies. Furthermore, the predictive uncertainty estimates are empirically well calibrated, i.e. the predictive variances correspond to the expected squared errors.
To conclude, the developed methodology makes it possible to produce high-resolution canopy height maps from Sentinel-2 at global scale. How such a model, trained within the GEDI coverage between 51.6° North and South, generalizes to regions north of 51.6° latitude remains to be evaluated with additional reference data.
References:
Dubayah, R., Blair, J. B., Goetz, S., Fatoyinbo, L., Hansen, M., Healey, S., ... & Silva, C. (2020). The Global Ecosystem Dynamics Investigation: High-resolution laser ranging of the Earth’s forests and topography. Science of remote sensing, 1, 100002.
Lang, N., Kalischek, N., Armston, J., Schindler, K., Dubayah, R., & Wegner, J. D. (2022). Global canopy height regression and uncertainty estimation from GEDI LIDAR waveforms with deep ensembles. Remote Sensing of Environment, 268, 112760.
Lang, N., Schindler, K., & Wegner, J. D. (2019). Country-wide high-resolution vegetation height mapping with Sentinel-2. Remote Sensing of Environment, 233, 111347.
Lang, N., Schindler, K., & Wegner, J. D. (2021). High carbon stock mapping at large scale with optical satellite imagery and spaceborne LIDAR. arXiv preprint arXiv:2107.07431.
Becker, A., Russo, S., Puliti, S., Lang, N., Schindler, K., Wegner, J.D., (2021). Country-wide Retrieval of Forest Structure From Optical and SAR Satellite Imagery With Bayesian Deep Learning. Under review
In recent years, numerous deep learning techniques have been proposed to tackle semantic segmentation of aerial and satellite images, trusting leaderboards of main scientific contests and representing today the state-of-the-art.
The encoder-decoder architecture has been widely used for semantic segmentation. Indeed the most popular frameworks for semantic segmentation rely on such an encoder-decoder architecture, e.g. U-Net or Segnet. Such frameworks have been widely used for the semantic segmentation of optical images with very high accuracies.
Nevertheless, despite their promising results, these state-of-the-art techniques are still unable to provide results with the level of accuracy sought in real applications, i.e. in operational settings. They most often perform tasks by learning on examples without having prior knowledge about the tasks. Millions of parameters have to be learned thanks to an optimization process, usually with stochastic gradient descent. Convolutional neural networks have already surpassed human accuracy in many vision tasks. Due to the capacity of convolutional neural networks to fit on a wide diversity of non-linear data points, they require a large amount of training data. Furthermore, neural networks are in general prone to overfitting on small datasets. The model tends to fit well to the training data, but is not accurate for new data. This often makes the neural networks incapable of correctly assessing the uncertainty in the training data and hence leads to overly confident decisions. In order to avoid over-fitting, several regularization techniques have been proposed such as early stopping, weight decay or L1 and L2 regularizations. Currently, the most popular and empirically effective technique to reduce over-fitting is dropout.
Thus, it appears mandatory to qualify these segmentation results and to be able to estimate the uncertainty brought by a deep network. In this work, we address uncertainty estimation in semantic segmentation. Bayesian learning for CNN has been proposed recently and is based on Bayes by Backprop. It produces results similar to traditional deep learning methods, along with uncertainty metrics. In traditional deep learning, models are conditioned on thousand (sometimes millions) of weights w that are learned during training. Once learned, the weights are fixed for further inference. In Bayesian deep learning, models are also conditioned on weights. However, we suppose that each weight follows an unknown distribution. This unknown distribution can be approached by a user-defined variational distribution q(w|theta). Generally this distribution q is a normal distribution and theta denotes the two parameters of the normal distribution, i.e. the mean mu and the standard deviation sigma. However, one can choose any variational distribution for the weights. Hence, unlike traditional networks, the weights of a Bayesian network are not fixed, but conditioned on the variational distribution which parameters are fixed after the learning phase. Thus, the weights can have a wider range of values, allowing the model to learn the data distribution more accurately. Monte Carlo Dropout is equivalent to Bayesian deep learning with its advantages and drawbacks. Its main advantage is that it can be performed by using traditional deep learning optimisation methods (e.g. it is not needed to add the Kullback-Leibler divergence in the cost function). The only condition is to have a learning layer (i.e. convolution layer or dense layer) followed by a dropout layer active in both training and prediction phases. The main drawback is the variational distribution; the user is not able to set the variational distribution. Thus, each weight can only have two values; 0 or a specific value learned during training. Although it seems limited, it is sufficient to learn more accurately the data distribution than a traditional network. In order to have relevant results, several predictions need to be performed in order to explore a sufficient number of values for the weights.
To validate the proposed approach, we consider four different datasets representing various urban scenes. While the first three are public aerial datasets aiming to ease research reproducibility, the last one is a satellite dataset allowing us to demonstrate the behavior of our method on spaceborne imagery as well. The semantic segmentation tasks covers binary classification (building/background) and multiclass classification.
Once trained, a Bayesian model produces different predictions for the same input data since its weights are sampled from a distribution. Therefore, several predictions need to be performed. At each iteration, the model will return a pixel-wise probability. The final semantic segmentation map is computed through a majority vote from all these predictions. One can then derive confusion matrices and usual classification/segmentation quality metrics (precision, recall, accuracy, f-score, intersection over union (IoU) and kappa coefficient kappa). The Bayesian model can also provide uncertainty metrics; two types of uncertainty measures are usually investigated. The Epistemic uncertainty, also known as model uncertainty, represents what the model does not know due to insufficient training data. The Aleatoric uncertainty is due to noisy measurements in the data, and can be explained away with increased sensor precision. These two uncertainties combined form the predictive uncertainty of the network. In this work, we derive two metrics, namely the entropy of the predictive distribution (also known as predictive entropy) and the mutual information between the predictive distribution and the posterior over network weights. These metrics are very interesting since mutual information captures epistemic (or model) uncertainty whereas predictive entropy captures predictive uncertainty which combines both epistemic and aleatoric uncertainties.
Built on the most widespread U-Net architecture, our model achieves semantic segmentation with high accuracy on several state-of-the-art datasets, with with accuracy ranges between 91% and 93%. More importantly, uncertainty maps are also derived from our model. While they allow to perform a sounder qualitative evaluation of the segmentation results, they also appear as a valuable information to improve the reference databases. Furthermore, we showed that our model is very robust to noise, especially when dealing with label noise.
This work has been published in Remote Sensing (https://doi.org/10.3390/rs13193836)
In recent years, deep learning improved the way remote sensing data is processed. The classification of hyperspectral data is no exception. 2D or 3D convolutional neural networks have outperformed classical algorithms on hyperspectral image classification in many cases. However, geological hyperspectral image classification includes several challenges, often including spatially more complex objects than found in other disciplines of hyperspectral imaging that have more spatially similar objects (e.g., as in industrial applications, aerial urban- or farming land cover types). In geological hyperspectral image classification, classical algorithms that focus on the spectral domain still often show higher accuracy, more sensible results, or flexibility due to spatial information independence. DeepGeoMap is inspired by classical machine learning algorithms that focus on the spectral domain like the binary feature fitting- (BFF) and the EnGeoMap algorithm. It is a spectrally focused, spatial information independent, deep multi-layer convolutional neural network for hyperspectral geological data classification. More specifically, the architecture of DeepGeoMap uses a sequential series of different 1D convolutional neural networks layers and fully connected dense layers and utilizes rectified linear unit and softmax activation, 1D max and 1D global average pooling layers, additional dropout to prevent overfitting, and a categorical cross-entropy loss function with Adam gradient descent optimization. DeepGeoMap was realized using Python 3.7 and the machine and deep learning interface TensorFlow with graphical processing unit (GPU) acceleration. This 1D spectrally focused architecture allows DeepGeoMap models to be trained with hyperspectral laboratory image data of geochemically validated samples (e.g., ground truth samples for aerial or mine face images) and then use this laboratory trained model to classify other or larger scenes, similar to classical algorithms that use a spectral library of validated samples for image classification. The classification capabilities of DeepGeoMap have been tested using geochemically validated geological hyperspectral image data sets. The Presentation will include a showcase of how a copper ore laboratory data set was used to train a DeepGeoMap model for the classification and analysis of a larger mine face scene within the Republic of Cyprus, where the samples originated from. DeepGeoMap can achieve higher accuracies and outperform classical algorithms and other neural networks in geological hyperspectral image classification test cases. The spectral focus of DeepGeoMap likely to be the most considerable advantage compared to spectral-spatial classifiers like 2D or 3D neural networks. This enables DeepGeoMap models to train data independently of different spatial entities, shapes, and/or resolutions.