"Increasingly relevant to Earth observation (EO) research, are issues regarding the process of data collection while handling the privacy and safety of data and models. Machine Learning and in particular Deep Learning is one of the most successful tools in this petabyte-generating science, to reveal the specific patterns of interest. While the amount of data is becoming more and more important, the question of how to deal with this data is also becoming more relevant.
Many EO questions, such as climate change, are global phenomena that reveal locally observed structures. In order to use a large number of different data domains, they have to be pooled in the data. In some cases, private data of certain operators are made available to all.
Thus, a highly relevant problem in EO deals with privacy of specific data sources, like the one used in remote sensing. While other processing schemes deal with centralized data, the ML paradigm of federated learning also deals with operating on the data generating edge devices.
We will talk about the lessons learned from demonstrators in the automotive sector inside the Catena-X network. Furthermore, we give an outlook on how to apply them to the problems of the EO sector."
INTRODUCTION
To produce accurate and valuable results Machine Learning models needs large data sets for training. Either the developer must collect these data or the models must be brought to the data. Providing access to the data and models poses risks of data and models leakage, violation of privacy, unauthorized appropriation or corruption of algorithms. New technologies must mitigate the risks and allow use and exploitation without violating ownerships and copyright of the users' data and models.
This paper presents the recently completed BLENDED project performed for the European Space Agency. BLENDED looked at technical issues perceived to be blocking the massive exploitation of space data. It investigated how the synergic use of blockchain and deep learning can help to best utilise the immense wealth of knowledge that is being generated nowadays. Blockchain was used to increase platform security through a peer-to-peer (P2P) network providing transparency, accessibility and trust to all participants. Smart contracts code described the conditions of the contract and actions to be taken when these conditions are fulfilled. Both the contract and the execution are observable by the whole network, protecting from fraudulent behaviour. The immutability guarantees the information cannot be removed or altered after it has been transmitted and can be used for accountability later.
APPROACH
BLENDED provides a platform enabling access to data and models able to run application workflows in a secured and controlled manner. Execution requests issued by end-users must receive all the necessary authorisations before being actually submitted for execution. Also, the transmission of input data and processing results is done via an IPFS network in which the data files may be encrypted.
Task execution requests, including specific resource requests, are encoded in smart contracts and submitted to the blockchain. The contract identifies the application to be executed, the target computing platform and the list of input datasets needed to run the application. Each of these selected resources has a uniquely identified owner whose authorisation is required to eventually execute the application. In addition to their acknowledgement, the data owners also provide the address of the corresponding data stored in the IPFS network. The application is executed in the target platform when the corresponding contract has been fully acknowledged. The recording of each acknowledgement in the smart contract requires the submission of a transaction in the blockchain.
In order to secure the access to the input data, the platform owner may store in the contract a public key that the data owner may use to encrypt the data before uploading it in the IPFS network. This data will then only be readable within the target platform. Similarly, the owner of the task, that is the user who created the task smart contract, may store in the contract a public key that will then be used inside the computing platform to encrypt the processing outputs before uploading them in the IPFS network. Encrypted outputs will then be readable by the task owner only.
BLENDED PLATFORM
The BLENDED includes the following main sub-systems:
- An IPFS storage network for storing, optionally encrypted, input and output products (EO and ground truth datasets, trained models, processing results, etc.) in an immutable and decentralised manner. A catalogue database stores searchable metadata about the resources available in the IPFS and a Resource Manager facilitates the encryption/decryption of data files and their transfer from/to the IPFS network.
- An Ethereum-based blockchain with smart contracts for managing the application execution requests and the necessary authorisations. By definition, the Distributed Ledger Technology (DLT) is a trustworthy, reliable, and credible technology, which serves for the storage of necessary information about tasks, all their (meta-)data and accounting on all of it. The smart contracts introduce a distributed mechanism of negotiation between various parties owning valuable assets. This negotiation will be immutable in time. The smart contract algorithm also controls the actions that are allowed on it, for example depending on the current state. In BLENDED, the task smart contracts make sure only valid transactions are executed all along their life-cycle. Tools have been implemented to facilitate the update of the smart contracts by the various BLENDED actors.
- An Execution Platform built using the Automated Service Builder (ASB) framework and used by the blockchain components for running the selected applications. The framework orchestrates the workflows implementing the applications in a distributed and potentially hybrid environment. In BLENDED training algorithms are executed in the HPC environment of the IT4Innovations. ASB exposes interfaces that allow the blockchain sub-system to trigger the execution of applications and collect the results upon completion.
DEMONSTRATION STUDY CASE
BLENDED demonstrates secure, private and distributed processing capabilities with a study case by training deep neural network models to predict urban changes using multi-temporal, multi-spectral optical and synthetic aperture radar (SAR) data. To demonstrate the applicability for real-world problems, we used three different sites (Liège, Rotterdam and Limassol), providing the data with Landsat 5 and ERS1/2 (1995-2010), as well as the Sentinel 1 and 2 (2017-now) missions. The sites show a diversity of climate and urban characteristics and provide a real-world data size of over 1 TB in total.
Testing of the model predictions is supported by non-space data (e.g. cadastral data), which is made available for the three sites for selected observation periods.
EXPERIENCE AND ACHIEVEMENTS
The outcome of the BLENDED project is an integrated platform that consists of a number of tools, applications and service deployments.
The IPFS technology has provided performance that is more than acceptable for the study cases of the BLENDED project. The added benefit of immutability adds a little to the platform’s overall trust for transferring valuable data. At the same time, public data sets can benefit from de-duplication and persistence features provided by IPFS.
Eventually, the smart contracts as well as the applications that interact with the blockchain or IPFS can be easily used with any deployment of both technologies. This provides the flexibility to eventually deploy the solution on the public services that may offer more trust or reliability and bind the use cases with real-life (crypto)currency economic.
CONCLUSION AND FUTURE STEPS
BLENDED has achieved its goals but we recognise that there is room for improvement. The current implementation requires trust on the computing platform provider. Homomorphic encryption (HE) was considered, included in the concept but not implemented in BLENDED. This is because, rather than artificial/academic solutions, BLENDED had to demonstrate real world workloads with large data set sizes (hundreds of GBs) and a non-trivial model complexity. At the point in time in the project when a decision was needed to implement HE no solution was available that did not incur severe processing overheads. Once methods exist to make it viable, it can be integrated in the BLENDED platform. Furthermore, BLENDED allows the use of TensorFlow Privacy (for differential privacy) that could be integrated into the algorithms. Data providers would then be able to add noise to their datasets to "protect" them. Studying this solution is proposed as part of follow work.
We also considered the use of "secure enclaves" for which there is hardware support in modern processors. The availability of such systems (e.g. with processors supporting AMD Secure Memory Encryption (SME), AMD Secure Encrypted Virtualization (SEV) and AMD Secure Encrypted Virtualization-Encrypted State (SEV-ES ) is currently limited. Nevertheless, it would still require trust of the platform provider to implement this feature (properly and honestly).
The BLENDED platform is generic, i.e not dependent on the data and the applications. Other datasets and formats, but also algorithms and workflows, can be considered. Different parties (actors) may be involved with different interests (i.e. security/privacy requirements). In addition to the demonstrated case study, the BLENDED platform can easily be used for other means, for instance, to transfer learning by providing a pre-trained model or inference/prediction of third-party data using a trained model.
The BLENDED platform and more specifically the ASB framework gives the possibility to train models where the private data is located. ASB may be configured to deploy and run workflow processes in environments that meet a minimal set of requirements. Private data would then be accessed locally instead of being fetched from the IPFS network. The trained model is encrypted and uploaded to the IPFS. The requesting user becomes the owner of the model as he/she is the only person able to decrypt it.
Concluding, BLENDED has prototyped a platform that allows processing Earth Observation and geospatial data in a trustful and secured manner. Although additional work is needed to take the last steps to implement features such as homomorphic encryption, a key aspect has been demonstrated, namely, that Blockchain, IPFS and AI/ML can be used in combination on real world cases meeting realistic performance criteria.
REFERENCES
- "BLENDED - Blockchain and Machine Learning", https://www.spaceapplications.com/news/blockchain-and- machine-learning, 2020.
- "Blockchain and Earth Observation", ESA White Paper, results from the Phi-Week 2018 workshop on Blockchain. https://eo4society.esa.int/wp-content/uploads/2019/04/Blockchain-and-Earth-Observation White-Paper-April-2019.pdf, 2019
-
ASB: https://www.spaceapplications.com/products/automated-service-builder-asb/
This paper proposes a review of privacy-preserving techniques related to statistical and Earth Observation data which can facilitate private visual mapping of useful socioeconomic indicators on a population-wide level. Surveying is a common task undertaken by numerous organizations to assess a wide range of general socioeconomic statistical indicators at a scale and granularity not feasible through other data collecting means. Furthermore, Earth Observation data is increasingly available, with high resolution optical imagery with fast revisit rates readily available, making it an effective means of visualising the results derived from statistical surveys. This high-resolution imagery can be labelled with the feature-rich statistical data to train powerful predictive models, developed through recent advances in the field of Machine Learning. It means that it is possible to generate new, representative data at an increased scale and frequency than when relying on in-person survey-taking, producing useful visual maps of actionable intelligence for previously unsurveyed regions. These maps can then be used to help shape critical policy and decision-making, directly improving the quality of life on the scale of whole populations.
However, to achieve these goals, effective analysis of the information-rich survey data is required. This process is hampered by potential privacy concerns of the survey respondent’s data, which since the introduction of data privacy regulation such as the General Data Protection Regulation in the EU, is increasingly sensitive. How can data be shared in such a way that facilitates maximum information discovery whilst preserving the privacy of survey respondents? We carry out a review of privacy-preserving techniques that are used on statistical data, identifying the main groups of techniques and assessing them based on their effectiveness, efficiency and privacy. We focus on the application of these techniques to statistical data sources which can be directly linked with Earth Observation imagery, with the example use case of producing visual maps of socioeconomic and statistical trends across populations from census data. These maps can be populated by the output of machine learning models which have been trained in such a way that the privacy and anonymity of the training data is guaranteed. Finally, we propose an ensemble of selected privacy-preserving techniques for this example use case, consisting of Differential Privacy, Generative Adversarial Networks and Continual Learning techniques, used in combination with Convolutional Neural Networks to privately generate useful maps of socioeconomic data in unsurveyed regions.
Fishing is an extremely important aspect of maritime activities and has significant impact (for better and for worse) on Ocean’s health and sustainability as well as on the livelihood, economic development and food security of many communities. The commercial fishing sector, significantly expanded in the past decades and total production, trade and consumption reached an all-time record in 2018, reaching 96.4 million tonnes (Stankus, 2020). An infinite amount of fishery data is both relevant and needed for ensuring that the management of fish resources is; logistically effective, economically viable, ecologically responsible, and biologically safe (Barkai and Bergh, 2003; Barkai et al., 2012; Gilman et al., 2019). Commercial fishing is a highly dynamic and multifaceted activity that exists within a complex and ever-changing ecosystem. The inherent complexity of fishing operations demands various factors of data be considered when providing management advice to regulatory authorities.
Though the Information and Communications Technology (ICT) revolution took the world by storm during the 2nd half of the 20th century (Aronson and Cowhey, 2018), it seemed to completely bypass the commercial fishing sector. This comes as a surprise when one considers the vast amount of data routinely collectable within the sector. An explanation for the disconnect between technology and its application within the fishing industry cannot be directed toward a lack of interest from fishers or their misbelief in the importance of such data - since history would demonstrate that the consistent collection of data by fishers is evidence of their obsession and value of such information. Instead, the problem should be directed toward the sensitive concern for data privacy within the sector; where fishers view data as their commercial interest; empowering them to locate good fishing grounds and in turn, improve their productivity.
Over the past 10 years, however, fishers in many parts of the world have been mandated to use electronic logbooks (eLog) to record and report on their fishing activities to maintain their license. Consequently, many fisheries have gained the capacity to, practically, record and report large amounts of good quality data with greater ease. Yet this increased introduction of data technology in the sector has raised fishers’ apprehension against such development since they worry that the reporting of their electronically collected data would compromise their perceived control over this information. Fishers’ mistrust of the system in sharing their data results in their reluctance to report any data other than that which is required as necessary for compliance. Therefore, a lack of guaranteed privacy in the sharing of fisher’s data adversely results in the widespread pattern of incomplete and inaccurate data being captured and fed into resource management decision-making. Science and management, however, demand much more data than that submitted for core compliance - requiring fisheries data to be at the highest possible resolution and available for merging of vessels’ information for analysis to extract the maximum scientific and management value.
In response to the prevalent issue of mistrust and misreporting, the Privacy-Preserving Machine Learning / Artificial Intelligence (PPML) project includes a specific use case that attempts to prove and demonstrate to fishers and resource managers the possibility of merging and analysing the data of many fishers, some in competition with each other, whilst maintaining confidentiality and data security. Therefore, the Privacy Preserving Bycatch Avoidance Application (PPBAA) main objective is to show fishers that they can derive significant value from shared data by combining it with Earth Observations, without the need to give away operational and commercial “secrets”.
The application framework relies on geolocated catch per unit effort (CPUE) information collected by the OLSPS’ Olrac eLog system as a training dataset for a machine learning application aimed at generating predicted catch distribution maps for each species of fish based in environmental parameters (SST, Chlorophyll, lunar index and others). While CPUE is notably affected by various biasing factors such as fisher experience and fishing gear technology, it is still suitable for this application, especially considering the main focus of the use case is the protection of sensible data and not the Machine learning (ML) model predictive performance.
Training data consists of CPUE distribution maps for both target (hake) and all other species which are thus defined as bycatch. ML capabilities are employed to relate environmental data obtained from Copernicus Marine Services to these labels provided by the training dataset. By employing state of the art machine learning approaches that have proven to be of great use for fishing patterns prediction, PPBAA shall produce accurate potential CPUE maps both target species and bycatch. In this way, outputs of the model provide means to predict specific areas where bycatch rates should be higher, due to their environmental characteristics that can be monitored by satellites. This shall allow fishers to plan their fishing operations in advance based on the likelihood of different species densities maps, favouring the avoidance of bycatch.
Information to be used in this project will be a mixture of data OLSPS obtained with the permission of several of its clients making use of the Olrac eLog system. Specific datasets are provided from a fishery that principally operates within the hake industry off the South African and Namibian coastlines. Trawl-based fishing is this industry’s major fishing capture method. In the provided dataset, there are 816 fishing trips that span a time period of 5 years; ranging from 2009 to 2014.
The entire system resides within OLSPS’ OlracDDM web-server hosted by Azure and is protected by the data protection system provided by Scontain. This company develops the SCONE confidential computing platform (Arnautov, 2016) which facilitates always encrypted execution: one can run services and applications such that neither the data nor the code is ever accessible in clear text - not even for root users. Only the application code itself can access the unencrypted data and code. SCONE simplifies the task of encrypting the input, executing the service/application in encrypted memory on an untrusted host, transparently encrypting the output and shipping the output back to the client.
All the above is achieved through security policies. OLSPS will define policies that ensure that only certain services (e.g., a specific machine learning algorithm) can access the data. Deimos will define policies that firewall the algorithm, so that no one can inspect it, as it contains intellectual properties that need to be protected.Firewall rules can be verified by ensuring that the algorithm can only access data if the generated model is exclusively visible to a privacy checker under the control of the data provider.
Data analysis is therefore conducted in a “blind manner”, and the analytical outputs are provided in a way that ensures complete anonymity of the training data. This creates several challenges, as it becomes very difficult for the practitioner to understand the data that will be used in the model, as well as enable a proper evaluation of the model.
The main aspect for which PPBAA’s performance is to be evaluated is the protection of sensible information, to be implemented by design and assured by security policies. A key aspect regarding the privacy requirements is that all identifiers of the vessels that were the original providers of training datasets should remain protected throughout the duration of the project. The use case will also evaluate the challenges of building good machine learning models without access to the data.
This project has received funding from the European Space Agency Contract No. 4000134424/21/I-NB.
Bibliography
Arnautov, S., et al., 2016. "SCONE: Secure linux containers with intel SGX." 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 2016.
Aronson, J. and Cowhey, P., 2010. The Information and Communication Revolution and International Relations. Oxford Research Encyclopedia of International Studies,.
Barkai, A., Meredith, G., Felaar, F., Dantie, Z. and de Buys, D., 2012. The Advent of Electronic Logbook Technology - Reducing Cost and Risk to Both Marine Resources and the Fishing Industry. World Academy of Science, Engineering and Technology International Journal of Biological, Biomolecular, Agricultural, Food and Biotechnological Engineering, 6(7).
Barkai, A. and Bergh, M., 2003. Use and Abuse of data in fishery management. Deep Sea 2003: Conference on the Governance and Management of Deep-sea Fisheries. 27- 29 November 2003. Dunedin. Theme 4. Technology requirements.
Gilman, E., Legorburu, G., Fedoruk, A., Heberer, C., Zimring, M. and Barkai, A., 2019. Increasing the functionalities and accuracy of fisheries electronic monitoring systems. Aquatic Conservation: Marine and Freshwater Ecosystems, 29(6), pp.901-926.
Stankus, A., 2021. State of world aquaculture 2020 and regional reviews: FAO webinar series. FAO Aquaculture Newsletter, 63: 17-18.
Maritime surveillance systems have progressively been employed to follow up the great increase in fishing activity that is stressing biodiversity in oceans. Stricter regulations have enforced the use of tools such as Automatic Identification System (AIS) and Vessel Monitoring System (VMS), which not only prevent vessel collisions in low visibility, but can also inform fishing vessel positions to authorities, electronic logbook (eLog) that record and report on fishing activities, among others measures. Nevertheless, these are still not enough to provide a proper understanding of the impact of fishing in the ecosystems [1-3].
One possible solution to assess the impact of this human activity is by combining AIS data with Earth Observation (EO) to monitor protected and/or particularly sensitive areas or other fisheries closure zones by either generating density maps of fishing vessels, or detecting Illegal, Unregulated and Unreported (IUU) Fishing. This is especially important to detect the activities of the smaller vessels that don't have trackers fitted, or vessels that switch off their AIS devices or tamper with them to shield illicit activities. Satellite-based vessel detection is thus a very active study area, with the first work being done 1978 with Landsat-2 MSS and a noticeable increase in the use of Machine Learning (ML) and Deep Learning methods in the last years [4,5].
While the lack of training datasets have been one of the key factors delaying such developments, data collected by AIS have recognized potential to fulfill that role. However, even if this data are legally considered to be public information, in the sense that they are available for any entity to collect or purchase, being a commercial service they still incorporate innate business value. Sharing this information in an unprotected way, thus, represents an improper transfer of value, with evident financial losses to the entity that purchased the data in the first place. Additionally, considering this data provides the location of fishing vessels, it might as well indicate potential fishing areas where this activity is usually concentrated, characterizing the sensible nature of this information and the necessity for it to be treated as confidential. For this same reason, AIS data should be strictly protected from interference and falsification, conserving its value and reliability for further analysis. The combination of these factors results in a unwillines of several different public and non-profit organizations that routinely collect this kind of data to share it among the community of users.
It is in this scenario that Privacy Preserving Machine Learning (PPML) techniques can be extremely useful by providing a trusted framework to create high-end models with private data. By employing PPML techniques, it is possible to harness the AIS positioning information from different owners while protecting the commercial and sensitive nature of these datasets and greatly minimizing the risk of spoofing and falsification. To address this challenge the Privacy-Preserving Machine Learning / Artificial Intelligence (PPML) project includes a specific use case to implement such a framework. The Privacy Preserving Fishing Vessel Detection (PPFVD) application aims thus to overcome the current constraints for use of AIS as training data for a ML-based vessel detection service, increasing the potential of these surveillance systems to support the sustainable management of fisheries.
Selecting the most suitable EO data to be used in PPFVD is critical as many factors are associated with it: availability, revisit time resolution, price (if the satellite data is not public), available sensors (e.g., only visible bands, SAR, NIR, etc.), among others. A judicious trade-off analysis of those factors is thus performed to identify potential EO data to be employed in a vessel detection service. The majority of previous related work used private satellite data to demonstrate new vessel detection techniques. Although private constellations may provide better resolutions, they imply a considerable budget to acquire the necessary volume of data. This is an increased constraint when we consider the high volume of available AIS data resulting in less accessible products due to its higher price.. Therefore, despite its lower resolution (10m/20m), PPFVD takes advantage of the Copernicus Sentinel EO data, freely distributed by the European Space Agency.
Considering the need to leverage the AIS of different providers and the associated privacy concerns, PPFVD employs Federated Learning (FL) techniques to develop the vessel detection pipeline. FL can be defined as a setting where several machines (clients) have data that cannot be shared and a central entity (a server) that coordinates the updates of a model that is trained individually in each client and aggregated in a central server [6-8]. The main motivation behind this emerging field is to avoid the sharing of data, allowing models to be trained on thousands of devices (e.g., smartphones) without compromising the privacy of users, and ensuring the necessary security between all entities involved.
Therefore, PPFVD framework uses open Earth Observation data to create a ML vessel detection pipeline with FL. AIS data from the Portuguese EEZ collected between 2016 and 2018 is distributed across different client machines (Silos) corresponding to main geographical areas: Azores, Madeira and mainland Portugal. In each Silo, AIS data is analyzed and automatically mapped to Sentinel imagery, labeling areas with the highest probability of having a vessel. This constitutes the train data for the algorithm, performed individually in each Silo. The server also sends its current model to the Silos. Each Silo then updates the model with this training data. The server collects all model updates and aggregates them to build an improved model, which is sent back to the Silos for a new round of training. After some rounds of training, the algorithm shall no longer converge, when it will be ready to be tested. PPFVD aims to ensure that the algorithm trained in this framework can achieve comparable performance to a similar algorithm trained in conventional manner, with all data at once.
To ensure full privacy of data and security of the application, PPFVD runs inside a Trusted Execution Environment (TEE). We use the SCONE confidential platform provided by Scontain to run applications inside of enclaves. Inside an enclave, code and data are always encrypted, with only the code of the application being capable of reading the data in clear text. SCONE is also employed in the communication process within the FL framework, ensuring that all communications between Silos and server are protected.
This project has received funding from the European Space Agency Contract No. 4000134424/21/I-NB.
References:
[1] de Souza, E., Boerder K., Matwin S., Worm B. (2016) Improving Fishing Pattern Detection from Satellite AIS Using Data Mining and Machine Learning. PLoS ONE 11(7) 11(7): e0158248. doi:10.1371/journal.pone.0158248
[2] Heiselberg, H., A Direct and Fast Methodology for Ship Recognition in Sentinel-2 Multispectral Imagery. Remote Sens. 2016, 8, 1033. https://doi.org/10.3390/rs8121033
[3] Ciocarlan A, Stoian A. Ship Detection in Sentinel 2 Multi-Spectral Images with Self-Supervised Learning. Remote Sensing. 2021; 13(21):4255. https://doi.org/10.3390/rs13214255
[4] Kanjir, U., Greidanus, H., Oštir, K., Vessel detection and classification from spaceborne optical images: A literature survey. Remote Sensing of Environment, Volume 207, 2018, Pages 1-26, ISSN 0034-4257, https://doi.org/10.1016/j.rse.2017.12.033.
[5] Li, B., Xie, X., Wei, X., Tang, W., Ship detection and classification from optical remote sensing images: A survey. Chinese Journal of Aeronautics, Volume 34, Issue 3, 2021, Pages 145-163, ISSN 1000-9361
[6] Li, T., Sahu, A., Talwalkar, A., Smith, V., Federated Learning: Challenges. Methods, and Future Directions. IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50-60, May 2020, doi: 10.1109/MSP.2020.2975749.
[7] Liu, J., Huang, J., Zhou, Y., Li, X., Ji, S,. Xiong, H., Dou, D,. From Distributed Machine Learning to Federated Learning: A Survey. CoRR abs/2104.14362 (2021)
[8] Kairouz et al., Advances and Open Problems in Federated Learning , now, 2021.
"Machine learning techniques are enabling significant innovation in data exploitation in many sectors. However, innovation in Earth Observation (EO) applications is currently hindered by the fragmentation of data ownership. This results in organisational silos through which data cannot easily be shared. The barriers to sharing are in part commercial sensitivities and in part privacy protection requirements. Commercial sensitivities exist around the value and control of the data, and what it may reveal about the sensors and processes that generated it. Privacy protection requirements are stringent for personally identifiable information and breaches attract severe penalties in many jurisdictions.
These barriers to data sharing are particularly significant for many EO use cases, which require the combination of an EO product and a dataset representing a ground truth of human related behaviour or activity. Such data is typically highly sensitive from a privacy perspective. In most cases, the complexity or feasibility of managing such commercial and privacy constraints invalidates the business benefit of offering the service. Advance in verifiably privacy-preserving techniques could however tip this balance, by providing the assurances required for data owners to exploit their data confidently, with overheads that maintain a viable business model.
Nowadays, a wide range of privacy preserving techniques is available, with different levels of maturity and with different capabilities and properties. Among them, Federated (or Collaborative) Learning (FL) [MMR+17] seems to be a particularly promising approach. It addresses the lack of data by enabling the training of a model on various private datasets without the need of centralizing them. In more detail, local models are trained on local private datasets and only some information about these local models is centralized and aggregated in order to construct the global model that will capture the knowledge of the multiple local models. In the classical version of FL, all local models share the same architecture and the global model is constructed using, for instance, a weighted average of the local model weights. Classical FL was shown to be very efficient in terms of the global model accuracy and transmission costs. However, its adoption in some of the use cases is slowed down by its insufficient security and privacy guarantees. Indeed, even if the data are not explicitly exchanged, privacy attacks on the global model are still possible. Especially the central aggregator does not only know the architecture of the local models, but can also deduce information about their internals, such as weights or gradient. Moreover, this aggregator is a single point of failure that can be subject to intentional attacks or accidental failures.
To address the problem of securing collaborative learning on highly sensitive data, we introduce MPC4SaferLearn: a privacy-preserving framework for collaborative learning with secure and robust decentralized aggregation [FK21]. The proposed architecture (illustrated in Figure 1) relies on an alternative to Federated Learning – the Private Aggregation of Teacher Ensembles (PATE) approach – that aggregates non-sensitive outputs of the local models instead of models internals [PAE+17]. In more detail, participants of the learning train a global model in a semi-supervised way by labelling a limited number of public data using private classifiers. During aggregation, labels coming from different sources are combined: for each sample of the dataset, the aggregator selects the label that was chosen by the majority of private classifiers. Then, it adds noise to the aggregation in order to provide privacy in a situation where a consensus is not attained between participants. This considerably reduces the risks of privacy leakage. Moreover, such approach is model-agnostic and therefore data owners can collaborate even if they local models have different architectures.
Our main contribution is that in order to protect the learning against intentional attacks or accidental failures, we replace PATE’s single aggregator with a distributed aggregation using Multi-Party Computation (MPC) to ensure both confidentiality and correctness of the processing. MPC is a set of cryptographic techniques that enables a group of ‘parties’ to collaboratively perform a computation, even if they do not fully trust one another. More precisely, each party is assumed to hold some private data that they do not want the other parties to learn. With MPC, the parties can use their private data in a computation in such a manner that each party does not learn anything else than the computation result. An important benefit of MPC is that it does not require the existence of a trusted third party. In MPC4SaferLearn, the aggregation of the outputs of the local models is distributed between two or more dedicated aggregator servers that run an MPC protocol. These aggregators servers can be selected among data owners or be independent entities (as presented in Figure 1). The computation is secure unless an adversary manages to corrupt all of the servers involved in the MPC protocol.
Figure 1 High-level overview of the proposed architecture with three data owners (participants) and two aggregators. An MPC protocol based on “secret sharing” is used to secure the aggregation: participants send their inputs in form of shares (an input can be recovered only if all of its shares are gathered) and the computation of the result is done in a distributed manner on these shares. Privacy of the inputs is preserved unless an adversary corrupts all of the aggregator servers.
To summarize, we propose MPC4SaferLearn: a framework for privacy-preserving collaborative learning that secures aggregation by distributing it over multiple participants or external providers. Participants of the training consortium do not have to rely on any trust assumption regarding each other’s. They can easily balance between privacy, security, and performance requirements by adjusting the number of devices over which the computations are distributed or by switching between the multiple available MPC protocols. Our first experimental results show that MPC4SaferLearn is well suited for even very sensitive applications (such as healthcare or military) and thus could be successfully apply in the context of earth observation data processing.
Bibliographie
[FK21] Flory, Pierre-Elisee and Kapusta, Katarzyna. “MPC4SaferLearn: Privacy-Preserving Collaborative Learning with Secure and Robust Decentralized Aggregation”. CAID (2021).
[MMR+17] McMahan, H. B., Eider Moore, Daniel Ramage, Seth Hampson and Blaise Agüera y Arcas. “Communication-Efficient Learning of Deep Networks from Decentralized Data.” AISTATS (2017)."