Crop type maps are critical datasets required for agriculture and food security monitoring using Earth observations. However, accurate crop type maps are difficult to produce for regions or crops for which there is limited labeled data for training machine learning classifiers. Recent work has explored techniques like transfer learning and meta-learning to learn from data-rich regions in order to improve performance in data-sparse regions. Existing meta-learning approaches typically leverage only direct inputs and outputs (i.e., the geospatial imagery and the associated target label). However, geospatial imagery and agricultural data are rich in metadata that can further inform these algorithms, such as the spatial coordinates of the data points. We introduce a new method called task-informed meta-learning (TIML), an augmentation to model-agnostic meta-learning (MAML) which takes advantage of this metadata.
Specifically, we leverage relevant metadata by encoding it into a task vector. We then use this task vector to modulate a meta-learning model’s weights for a specific task. Intuitively, this modulation can be interpreted as pushing the model weights towards an optimum from a given task before any fine-tuning happens. In addition, we introduce the concept of forgetfulness, which consists of removing training tasks the meta-model has memorized from the training loop, in order to improve performance on difficult tasks. This concept is especially meaningful when tasks are spatially defined, since many tasks can contain very similar (or overlapping) data, so forgetfulness improves performance for difficult tasks without penalizing performance elsewhere.
We demonstrated the utility of this method by training it on the global CropHarvest dataset, a global dataset of agricultural class labels paired with remote sensing data. We then evaluated TIML in 3 different agro-ecologies: Togo, Kenya and Brazil. These regions also cover a wide range of training-dataset sizes (from ~500 positive examples in Togo to 26 positive examples in Brazil) and target crops (from crop vs. non crop classification to maize identification). We compared our methods to a range of baselines released alongside the CropHarvest dataset, including traditional MAML. TIML was the most consistently performant method in these three agro-ecologies and improved average performance compared to the benchmark models, measured using both F1 score and AUC ROC score. Finally, we performed ablation experiments to identify the contributions of different parts of TIML to the final performance, allowing us to gauge the effects of the task information, the encoder and forgetfulness on the model performance. We find that the encoder architecture contributes significantly to the model performance. In addition, we find that forgetfulness boosts performance for difficult tasks without penalizing performance elsewhere. This can inform future algorithm development and the application of TIML to other remote sensing datasets and tasks.
In pursuit of a high-resolution, up-to-date, trusted land-use classification product, the Earth Observation community will need to leverage existing labelled datasets, state-of-the-art data science techniques, and the massive quantities of unlabelled data coming from a variety of sensors. The CORINE Land Cover (CLC) 2018 map offers a complete picture of European land-use, at a resolution of 25 ha in the year 2018. This can act as a springboard for the creation of continuous, higher resolution maps.
Planet is preparing to release a corpus of 500’000 time series across Europe, with their associated CLC 2018 labels, as part of the Horizon 2020 project RapidAI4EO [1] [2]. In this work, we have created a parallel dataset of contemporaneous, co-registered Sentinel-1 time series over a subset of the RapidAI4EO locations, carefully curated and pre-processed to have comparable acquisition characteristics. With the combined SAR and optical time series data, we propose a self-supervised learning technique to retrieve clusters of pixels that correspond to different CLC classes, at a higher resolution than were originally mapped.
Based on the SimSiam architecture [3], a neural network encoder is trained to be invariant when embedding pixels of the same class into a latent feature space. As input, it takes a pixel’s yearly time series in both 4-band optical and SAR. By providing the encoder with two pixels which are very likely to be of the same CLC class (e.g. pixels in the same CLC class, or the same pixel in different years) and training for their embeddings to be the same, the outputs of the encoder naturally become strongly clustered along different CLC classes. Attention-based and convolutional encoders are explored and tested.
The temporal nature of the input data, and its multi-modality, are key to the success of this technique. Compared to single images, time series contain information that is essential for recovering the land-use class. For example, it would not be possible to correctly classify wetland regions if they are temporarily dried up during summer months. The fusion between SAR and optical also allow for features that are not visible in one or the other to be recovered. For example, SAR’s ability to discriminate vegetation structure and areas of different moisture levels, and optical data’s discrimination of vegetation colour and surface albedo.
The CLC classes are provided in a much coarser resolution than our data, and thus leads to noisy training labels when used in a self-supervised setting. This presents a significant challenge for the method, and also makes validation difficult, because no trusted ground truth exists at that resolution. Nevertheless, this work underlines the power of combining SAR with optical data, in comparison to methods which use only one or the other modality, and in embracing the temporal signature of different land cover types, as has long been recognised in land use mapping work.
Future work could focus more on the flagging and analysis of outliers and anomalies in the outputs of the model. A measure of a pixel’s similarity to others in its class could act as a measure of confidence, given that pixels which are weakly clustered are less likely to really be of the same land cover type. These outliers may be due to a failure of the model, or may represent classes not included in the original CORINE classification scheme, which are perhaps only detectable when considering higher resolution data.
[1] Davis, T., Bischke B., Helber, P., Marchisio, G., Senaras, C., Zanaga, D., Van De Kerchove, D. , Wania, A., "RapidAI4EO: A Multi-Format Dataset for Automated Land Cover Classification and Change Detection”. In Proceedings of the 2021 Conference on Big Data from Space (BiDS’21), pp. 65-68, doi:10.2760/125905.
[2] Marchisio, G., Helber, P., Bischke B., Davis, T, Senaras, C., Zanaga, D., Van De Kerchove, D. , Wania, A., "RapidAI4EO: A Corpus for Higher Spatial and Temporal Reasoning," 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, 2021, pp. 1161-1164, doi: 10.1109/IGARSS47720.2021.9553080.
[3] Chen, X. and He, K., 2021. Exploring simple Siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15750-15758).
Recent remote sensing instruments acquire high spectral and spatial resolution satellite image time
series (HR-SITS) at frequent dates, useful for Earth monitoring applications (land cover mapping, bio-
physical studies, etc...). For instance, the Sentinel-2 (S2) Constellation provides observations with con-
stant viewing angles at least every 5 days, with ground resolution ranging from 60 to 10m in 13 spectral
bands. This large amout of data allows increased performance on current applications: S2 high revisit
rate increases possibilities for change detection, and decreases the risk of lack of observation due to cloud
cover. However this data has high dimension and information of samples can be sparsely distributed
and redundant. Available measurements are not evenly sampled in the temporal dimension due to cloud
cover.
Current approaches in satellite image processing involve machine learning models, such as Neural
Networks, Random Forests, Support Vector Machines, etc., to learn the structure of the dataset and
make predictions (typically land cover classification and mapping [2]). But most methods rely on a
supervised training with a labelled data, which is fastidious and costly to build for operational Earth
monitoring applications over large areas. Furthermore, computer-intensive training of these models
must be performed for every new application. Therefore, learning representations of SITS that can be
used across many applications may alleviate training needs, with less training time and reference data
necessary.
Deep Generative Models, such as Variational Auto-Encoders (VAE) [1] are a promising unsupervised
approach to learning representations. They combine Deep Learning with probabilistic modelling, are do
not require labelled data at training stage. Auto-Encoders (AE) encode input data into a latent space of
lower dimension (thus a representation of the data) and decode it to reconstruct the input data. They
are trained by minimizing the reconstruction error. VAE, on the other hand, encode data as Gaussian
distributions, that add continuity in the latent space and can be interpreted as the inference uncertainty
of the latent variables.
Despite their ability to learn latent representations of the data, the representation produced by VAE
is generally unstructured and uninterpretable. The structure of the data cannot be accessed, and predic-
tions cannot be explained or justified. Also, different trainings may lead to different latent structures.
Learning disentangled representations is an attempt to improve latent space structure: assuming the
existence of generating factors of the encoded data, a disentangled latent representation uncovers these
factors, so that a variation along those dimensions results in small variations in the reconstruction [3].
Disentanglement learning is usually done by altering the VAE objective function, by designing encoder
and decoder architectures or selecting a latent distribution prior. These methods attempt to uncover
representations from learning, and the learnt representations may or may not be interpretable.
However, in remote sensing there are strong priors on the observed data, so we know that specific
structures may exist in high levels embedding of data. There are physical models of the formation and
observation of the sensed signal (radiative transfer models), and there are temporal evolution models of
the observed phenomena (phenology). For instance, in the case of crops, the evolution of the Normalize
Difference Vegetation Index (NDVI) describing the vegetation photosynthetic activity, can be modeled
very well using a parametric model that accounts for growth and decay along seasons, and whose variables
are phenological parameters [4].
Parameters of bio- and geo-physical models are ideal candidates for human-interpetable representation
of data, when the data fits the model. Therefore, we propose here to incorporate prior knowledge on
data to guide the unsupervised learning in VAE toward interpretable representations. Specifically, we
present a method enabling the estimation of the distribution of phenological parameters from NDVI time
series of crops in the latent space, which becomes an interpretable representation. To achieve this, we
remove the Neural Network decoder in the VAE to replace it with a model of the temporal evolution of
NDVI of crops, parametrized with the latent space variables. Thus, the latent space becomes de facto
interpretable: an interpretable and well-structured latent space is imposed and not learnt anymore. The
encoder is trained to infer a distribution of the model’s physical variables. However, as some physicalvariables may not be correctly modeled by Gaussian distributions, we also replaced the prior in the latent
space. Likewise, the traditional VAE objective function cannot be used for training.
This autoencoder can be trained in an unsupervised fashion from real Sentinel-2 time series. We
use a dataset of about 10 6 pixels acquired with a maximum of 110 dates per year on the areas of orbit
overlapping. Labels are provided with 20 land cover classes. As physical models can also be used to
simulate data, a labelled dataset can be generated both for supervised and unsupervised learning. We
use this simulated data set to provide quantitative measures of inference quality.
Our method uses an hybrid AI approach by combining Deep Learning with bio-physical models
as priors. To enable training, we redefined the objective function, the latent space distribution and
reparametrization techniques to ensure the random sampling process is compatible with gradient back-
propagation.
Using the proposed method, we are able to extract full posterior distributions (and therefore uncer-
tainties) of phenological variables from Sentinel-2 image time series through unsupervised training.
Extracting semantic and meaningful information is one of the most important applications of remote sensing, that provides the necessary information for various exercises. Remote Sensing techniques, due to their unique capabilities, are known as one of the most effective approaches for information extraction. Recent advancements of different remote sensing imagery systems, which provide very high-resolution Earth Observation (EO) images with various characteristics, as well as immense development of data processing techniques and computational capacity, have provided unprecedented opportunities for semantic information extraction through remote sensing techniques.
However, classifying high-resolution EO images into semantically meaningful classes is not straightforward. Unsupervised classification methods suffer from the lack of semantic labels. On the other hand, supervised classifiers, despite resulting in semantically meaningful classes, require accurate Ground Truth (GT) data, beforehand. Obtaining adequate GT data is very costly and laborious, and even impossible for some study cases. In addition, very high-resolution EO images contain abundant semantic details of the land cover which might get neglected in the user-defined GT maps.
Various data mining studies suggested unsupervised semantic discovery methods such as Latent Dirichlet Allocation (LDA) model to solve this problem. LDA is a generative probabilistic model which has been proposed for topic modeling in text mining. In the image domain, LDA models each image as a mixture of latent topics from a Dirichlet distribution and can be applied to EO images to generate topic maps based on the low-level features, extracted from the image. The generated topic maps can be utilized for latent semantic information discovery in various contexts such as creating or correcting the existing GT maps, target detection, and semantic annotation analysis.
Moreover, high-resolution EO images with abundant details require powerful and robust classification procedures. Advanced deep learning architectures have achieved state-of-the-art classification results and are at the core of researchers’ attention. But deep learning methods require a large number of training annotated data, which considerably restricts their applicability for small-scale areas with a limited number of labeled training data. On the other hand, conventional classification methods, such as Support Vector Machine (SVM), can be trained with a lower number of training samples. Most of these conventional classification methods apply the classifiers to the low-level features, extracted from the image. Despite the moderately reasonable results, the semantic gap between the low-level features, and the high-level concept of the semantic land cover labels decreases the classification accuracy and their trustworthiness. Mid-level representation models, such as Bag of Visual Words (BOVW) model, have been utilized to cover this semantic gap with a mid-level representation bridge between the low-level features and semantic labels in many studies.
In the present study, a latent semantic information discovery method is applied on various EO images, using two well-known data mining techniques, LDA and BOVW representation models. The reliability and suitability of the method are comprehensively evaluated in different scenarios for different applications and the results demonstrated the effectiveness of the method for discovering latent semantic information in EO images. The main purpose of this study is to demonstrate the necessity of semantic analysis for information discovery in EO applications, and how this procedure can boost the performance of different machine learning algorithms, especially classification algorithms, in EO image processing practices. Furthermore, the applicability of the latent semantic information discovery for Ground Truth map enhancement, target detection, and semantic annotation analysis of EO datasets is demonstrated.
In terms of the GT map enhancement, the semantic information discovery method is utilized to detect the neglected semantic classes in a user-defined GT map and create an enriched GT map with correct semantic classes. It is shown that the enhanced GT map not only results in a more semantically comprehensive and meaningful classified map, but also improves the overall performance of the classifier, in terms of the fewer classification errors.
Regarding the target detection, the latent semantic information discovery method is employed to detect the target phenomenon (i.e., a wildfire in this case) in EO images in a completely unsupervised manner. The results illustrated the capability of the semantic discovery method to detect the wildfire-affected areas in EO images both shortly after the incident, and also after a long time (i.e., several months) when there is no trace of the wildfire left in the area. The semantic analysis method can be employed in various practical applications, including disaster management and safety measures.
As a result of the immense popularity of deep learning methods and the fact that these deep architectures require a high number of annotated training data, several studies have introduced annotated EO datasets. Despite the dataset quality (e.g., semantic annotation) evaluations that have been provided in these studies, several annotation errors and misclassifications can be seen in the dataset that deteriorates the performance of the classification procedure. The effectiveness of the latent semantic information discovery method for annotation analysis of the EO datasets is illustrated in this study. The semantic discovery method can be used to detect and remove the misclassified samples in the dataset, as well as the samples with mixed or ambiguous semantic labels. A more semantically robust annotated dataset will boost the performance of the deep architectures for EO practices.
In conclusion, conventional machine learning methods are not capable of extracting latent semantic information from advanced remote sensing imagery systems with abundant semantic information. Discovering latent semantic information from EO images is necessary to harness the capabilities of advanced remote sensing technologies. Data mining-based semantic information discovery techniques including LDA and BOVW models are employed in this study with different remote sensing datasets and proved to be comprehensively effective to extract the latent semantic information from EO images for various applications.
Recent deep learning methods for computer vision often require pre-trained backbone networks for the extraction of feature maps. Meanwhile, the availability of pre-trained neural network models in remote sensing is low, which can be attributed to a number of factors. The large number of different sensors and sensing modalities (optical, SAR, hyperspectral) would require pre-training individual networks for each sensor. Further, there is a lack of large annotated datasets comparable to ImageNet. With new self-supervised learning methods closing the performance gap to supervised models in computer vision, these techniques bear great potential for resolving these issues by enabling the training of backbones in remote sensing without labelled data.
We present a multi-modal approach to self-supervised representation learning for remote sensing data by combining imagery with geo-tagged audio recordings. These are sourced from a large crowd-sourced online library, called Radio Aporee ::: Maps, which puts great emphasis on quality and scenic descriptiveness of the submitted recordings. This collection provides a database of over 50,000 geo-tagged audio recordings with a combined length of more than 3,500 hours. Using the embedded geolocation, we match each recording with a corresponding image from Google Earth. The resulting SoundingEarth dataset far exceeds existing audiovisual datasets with regard to number of samples and duration of audio, making it a viable candidate for self-supervised learning methods.
Similar to recent self-supervised approaches like SimCLR and MoCo, our model learns features by matching the corresponding embeddings in a high-dimensional embedding space while ensuring that distinct embeddings remain farther apart from each other. Instead of generating multiple augmentations from a single image, our framework trains the models directly on the audiovisual correspondence. To this end, the image encoder (ResNet-18/50) is complemented with an audio encoder (ResNet-18), which extracts features from log-mel audio spectrograms. In extensive evaluations, a batch-wise extension of the triplet loss formulation surpassed the Contrastive Loss used in SimCLR.
In order to validate the quality of the learned features, we evaluate our pre-trained models by fine tuning the weights for a number of downstream tasks, and compare the performance to that of models initialized via other means. These experiments mainly evaluate the quality of the features extracted by the visual encoder, as imagery remains the primary data source in remote sensing. For comparison, we perform the same fine tuning experiments with models initialized randomly, with ImageNet weights, and weights obtained from recent SSL methods like SimCLR and MoCo. These experiments on downstream tasks show that models pre-trained using our audio-visual framework indeed outperform other models in tasks like Aerial Image Classification and Aerial Image Segmentation.
Further applications are the use of combined audio-visual for scene classification, where the type of the scene is predicted from aerial imagery and local sound, as well as cross-modal content retrieval, where the model can give an impression of the local sound ambience from given imagery.
Although the existence of vast amounts of remote sensing data, most of it remains unlabeled and thus inaccessible for supervised learning algorithms. This issue can be partially alleviated thanks to transfer learning. However, most of the available models for fine-tuning are pre-trained on ImageNet, a natural imagery dataset, and their generalization to remote sensing imagery is not guaranteed due to the domain gap. Nevertheless, they often work very well in practice. Therefore, the question is: can we leverage all the unlabeled data to pre-train models with better generalization properties to satellite imagery than models pre-trained on ImageNet? The branch of artificial intelligence that tries to answer this question is known as unsupervised learning, or alternatively and more recently, Self-supervised learning (SSL). Despite SSL is growing in popularity lately for computer vision, most applications are restricted to the natural imagery domain. Other examples on specialized domains, such as medical or remote sensing, are starting to appear. An example of this kind of work is SeCo [1] where the authors propose a contrastive learning approach to obtain better initialization models. Pre-trained models using the SeCo framework achieve better performance than ImageNet models on downstream tasks such as image classification on EuroSAT and BigEarthNet.
Labelling a dataset is an expensive task. For a reference, labelling ImageNet (a dataset with 14 million images and 22k different classes) took about 22 human years. And even though this is a large dataset, there are far more concepts in the world (video, temporal information, etc) that the dataset does not cover. Overall, we can say that labelling does not scale very well. Self-supervised learning tries to solve the problem by learning from the data itself, without labels, by observing some part of the data and trying to predict another hidden part. This is called a “pretext task” and, in the context of computer vision, examples of this tasks include the prediction of the relative position of image patches, solving Jigsaw puzzles, predict rotations, etc. This type of tasks, however, have little to do with the downstream task (i.e. classification, object detection or segmentation). Ideally, pre-trained features should represent how images relate to one another and also be robust to factors such as lightning, object position, color, etc. To achieve this goal, SSL models take advantage of data augmentation, encoding different augmented versions of the same image and maximizing the similarity between their feature representations. This is the spirit behind most recent SSL methods such as Contrastive learning methods (MoCo, PIRL, SimCLR), Clustering methods (Deep Cluster, SeLA, SwAV) and Distillation methods (BYOL, SimSiam). A recent alternative based on redundancy reduction called Barlow Twins [2] has been proposed in contrast to similarity maximization. Inspired by information theory and neuroscience concepts. Roughly speaking, we want the output produced by a neuron to be invariant to image transformations, while also reducing the redundancy between all the neurons. We can achieve this goal by computing the empirical cross-correlation of the features given by a network and forcing it to be as close as possible to the Identity matrix. In comparison with other methods, such as contrastive learning or clustering, Barlow Twins is simpler to implement and require less computational resources in order to achieve good results, mainly due to the possibility to avoid the use of negative examples and its robustness to batch sizes. On the other hand, the method is sensitive to the data augmentation used during training and the projection dimensionality used. The end goal is to obtain a pre-trained network with a good weight initialization for downstream tasks. For validating pre-trained model, the most common used technique is known as “linear probing”, which consists on training a linear classifier on top of the frozen pre-trained backbone with a public dataset and report metrics on such task.
At EarthPulse, within the context of the AI-Pathfinder project, funded by ESA, where we explore the latest AI techniques in the ESA development agenda, we demonstrated the ability of unsupervised learning to improve the data efficiency of neural networks.
We pre-trained two reference neural networks (a ResNet18 and a ResNet50) and compared their performance with ImageNet pre-trained models and randomly initialized networks. For fair comparison, we use the SeCo dataset to pre-train the models. The SSL approach resulted in models with better generalization and outstanding performance in the low label regimen, where we are particularly interested in. For validation, we performed linear probing on the EuroSAT dataset at different label ratio. With as little as 20 examples for class (a total of 200 images for all classes) the model was able to achieve almost 90% of accuracy. This is a massive decrease in labelling efforts. In contrast, ImageNet pre-trained models achieve 80% accuracy and a model trained from scratch 57%. Another important aspect is the low relative difference between the fine-tuned and transfer-learned versions of the SSL models, clearly indicating that the features learned in the pre-training phase are very good. Keep in mind that the transfer-learned versions only train a linear classifier resulting in low computational resources and faster training (20k vs 24M parameters in the case of a ResNet-50). Furthermore, fine-tuning on the SSL backbone gives better metrics than fine-tuning on an ImageNet pre-trained model up until 20% of the labels per class are available (around 400 samples). An important remark is that neither training from scratch nor transfer learning from ImageNet can improve, since the architecture and data is fixed. However, results for SSL can in theory improve by using better SSL strategies and more (unlabeled) data.
In this abstract we will extend the study to a data fusion problem. Data fusion is one of the most important aspects of Deep Learning applied to Earth Observation. Being able to leverage information from multiple satellites at the same time is crucial, especially when some images are not available or cannot be used and other sensors without those limitations can help. An interesting application of data fusion of Sentinel 1-2 imagery is cloud removal [2], where a generative network make use of a S1 image to produce a cloudless S2 image from its cloudy counterpart. This suggests that the information occluded by clouds can be recovered by the S1 image, since radar can see through clouds. The artificially generated cloudless S2 image can then be used for the downstream task, although the performance and applicability of such images is still an open issue. In this work, two different neural network backbones are trained with Sentinel 1 and Sentinel 2 images, respectively. Afterwards, the models are finetuned in a data fusion configuration for a downstream classification task achieving great performance with a low number of labelled samples.
[1] Oscar Mañas et al., «Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data,» 2021
[2] Jure Zbontar et al., «Barlow Twins: Self-Supervised Learning via Reduncancy Reduction.,» 2021
[3] Andrea Meraner et al., «Cloud removal in Sentinel-2 imagery using a deep residual neural network and SAR-optical data fusion,» ISPRS Journal of Photogrammetry and Remote Sensing, 2020