Solving a hard problem – retaining privacy in Big Data Analytics

TL;DR: Big Data Analytics while retaining privacy is solved using distributed and encrypted processing on the Enigma network1. The article describes fundamental principles of the involved cryptography, Enigmas architecture and use-cases in public administration.

Authors: Johann Höchtl (@myprivate42) & Bettina Rinnerbauer (@BRinnerbauer)

Many call Big Data a disruptive technology, having the potential to fundamentally change the way we live, work and think2. While the fundamentals of Big Data have already been fleshed out in 2001 by Douglas Laney without actually coining the term3, it is the blending-together of ubiquitous Internet access, commoditized distributed computing hardware and new breeds of algorithms what actually enables what is called Big Data today. Companies like Ebay and Amazon invest massive efforts to further improve their predictive abilities by ever increasing capabilities to analyze data in or near real time to better serve and cross-sell to their customers, crowding out competitors. According to Forbes Tech, immense benefits of Big Data lie in the combined analysis of internal data sources and data sources external to institutions borders4. The recent appearance of Open Data as an additional source to the analytic arena will further nourish the potential benefits of Big Data.

Where there is light, there is shadow

However, these developments come at a cost. Some predict that the future of selling will be individualized. xRM5 analytics, combined with recommender systems tremendously contributed to personalize the shopping experience, the future will be a privatized price, taking into account individual elasticity preferences. The same product at the same store, bought at the same time will have a unique price per customer, depending on the Big Data mined readiness to buy the product at the given price. This inevitably raises questions of market fairness6.

Privacy concerns holding back widespread Big Data application

Another issue is privacy. Many of the Big Data analytic benefits lie in the analysis of personalized data correlated with behavioral patterns. For a prolonged period of time taking decisions affecting individuals based on large-scale and real time private data has been the core business of intelligence agencies. For example, correlating intelligence data and public information like pictures taken by random pedestrians, has been the key to success in identifying the Boston bombers7. However, as storage capabilities, data transmission abilities and computing power gets cheaper following Moores and Gilders laws8, public agencies worldwide started investigating the potential areas of applying Big Data methodologies beyond intelligence services. The potentials of Big Data in public administration are, among others, individualized citizen services, to reduce fraud and to increase overall performance9.

Personal Data Protection in Austria

Privacy concerns in public administration are not equally dispersed around the globe. The US have no comprehensive data protection law. On a federal level, the US maintain a sectoral approach towards data protection legislation where certain industries are covered and others are not. At a state level, most states have enacted some form of privacy legislation10. This is in strong contrast to the EU, which has a personal data protection framework, formed especially by Directive 95/46/EC11, which provides the Member States with a minimum denominator to betransposed into national law. This leads to an EU-wide harmonization of data protection law to a certain extent. In contrast to Art 2 of the aforementioned Directive, which defines “personal data” as any information relating to an identified or identifiable natural person (“data subject”), the Austrian Data Protection Act12 protects personal data of a natural person as well as personal data of a legal person (cf. § 1 section 1 and § 4 number 1 and 2 DSG).

The requirements for the use of “sensitive data” are stricter in comparison to the use of personal data as defined above. According to § 4 number 2 DSG, sensitive data is defined as data of a natural person with a certain content like health data, ethnicity or a political opinion.

From an Austrian legal point of view, the use of personal data is bound to certain requirements. According to § 1 section 1 DSG, everybody has – especially with regard to the right to respect private and family life – the right to secrecy of the personal data concerning her/him, as far as she/he has an interest, which is considered as being worth to be protected.

One of the possibilities to make use of personal data is having the consent of the data subject. This declaration of intent has to be given without force and with the knowledge of the data subject, that she/he is giving her/his consent for the specific case and informed about the factual situation (§ 4 number 14 DSG). In the context of Big Data-Analytics, it will often be a huge challenge to obtain a valid consent, because this requires – together with the principle of Art 6 number 1 b Directive 95/46/EC, that personal data must be collected for specified, explicit and legitimate purposes and not further processed in a way incompatible with those purposes, which has been transposed into § 6 number 2 DSG – knowing the purpose of the use of data already at the time of collection.

Therefore, when firstly processing personal data, it would be neccessary to know e.g how the data will be used further for analysis in order to obtain a valid consent.13

“Data” (“personal data”) is defined by § 4 number 1 DSG as information about data subjects, whose identity is identified or identifiable. Data is “indirectly personal” for the controller (the one making the decision to use the data), the service provider (e.g. who is performing actual analysis) or the recipient of a transmission (e.g. the recipient of the analysis results), if the relation to a person cannot be reproduced by the respective controller, service provider or recipient of a transmission, with legally permitted means. As a conclusion, data is considered as being not personal at all, when nobody can link the data to a specific person.

“Indirectly personal” data may also be called “pseudonymized”, while data without any relation to a person is also named “anonymized”.14

For § 1 DSG regulates the right to secrecy of personal data as far as there is an interest considered as being worth to be protected and further prescribes, that the existence of such an interest is excluded, when there is no right to secrecy because of the general availability of the data or because of the lack of the traceability of the data to the data subject. So, data without a link to a specific person – or anonymized data – is out of the scope of the DSG. As a consequence, there are no requirements of the DSG to be met when using such data.

Having in mind the idea of Big Data Analytics to serve as a means to make predictions, to analyze the decisions of the past or to build a basis for decision-making not only by answering questions, but much more by interpreting the outcome of combined data, it would be a suitable solution to work with non personal data. To reach anonymized data, it would have to be assured, that no reference or link to a specific person can be made by anyone (including especially the controller, which decides to use the data and the service provider, which is mandated by the controller and shall use the data solely to produce a new work).

Approaches towards data privacy

Analysis of data wouldn’t have to meet the requirements of the DSG if it happened after anonymization. Data is effectively anonymized if it is impossible to derive the person the data is dealing with, neither directly nor by cross-sectioning it with other data. On the contrary, pseudonymisation would lead to the application of the DSG.

In “An evaluation of technologies for the pseudonymization of medical data”, Neubauer and Kolb discuss different approaches to the problem of anonymization of medical data. The Peterson approach, for instance, delegates questions of privacy to the data owner: It’s up to the data owner to decide on every use-case to whom to grant access to his personal data. While this approach assures privacy in organizational terms, it doesn’t scale to automated IT processes. In this regard, the Pommerening approach is more promising. At its core, a pseudonymization service encrypts a personal identification (PID) of the person whose personal data is to be analyzed with a hash algorithm and the medical data is encrypted with the public key of the user or service, which actually carries out the analysis, without being able to identify it. The actual identity is kept in the PID service to be able to notify the data owner (patient) in case medical irregularities get noticed.

Health care is an excellent example for data protection as the data at question is typically sensitive and the data owners do have elevated needs for keeping their identity in secrecy. However, both the Peterson and Pommerening approaches actually require the involved security assuring personnel to play fair. A better methodological approach would be to algorithmically assure that analytical results can never be matched to the data base. Based on the theoretical challenge of two millionaires wanting to know which one is richer without revealing each other their respective wealth, a solution to this problem has first been proposed in 1982 by Yao in “Protocols for secure computations”. His solution to the problem was theoretically proofed to effectively guarantee the separation of data from results, but actual operations on the data may be up to a million times slower than direct processing15, making his suggested methodology effectively prohibitive for any form of real-time processing.

Homomorphic encryption

Homomorphic encryption enables a way to encrypt data such that it can be shared with a third party, used in computations without it ever being decrypted and sending analysis results back to the data owner who is the only one to make use of the computation (analysis) results of others.

 Principle of homomorphic encryption explained on a use-case

homoenc

  1. A data owner decides to give away data for analysis, yet the data contains personal data. Therefore he encrypts the data using a homomorphic encryption scheme and provides a detailed description about the data entities to a data analyst.

  2. The data analyst receives encrypted data and, based on the detailed data description, performs operations on that data. As the data he received is encrypted, he is not able to inspect the data and the results of calculations on the encrypted data do not make sense to him either. The data analyst returns the results to the data provider / owner.

  3. The data provider decrypts the results and

  4. correlates them to the original calculations expressed by the calculation algorithm.

In 2015, Guy Zyskind, Oz Nathan (MIT Media Lab) and Prof. Alex Pentland (MIT Connection Science and Human Dynamics Lab, founder and former director of MIT Media Lab) proposed a practical solution to a long standing problem in computer science: distributed encryption at an actually usable speed. They created an actual solution to the principles of homomorphic encryption called Enigma16, named after German WWII encryption technology17. Unlike solutions provided by Yao in 1982 or improved computation schemes by IBM in 2009, their solution to the problem performs “only” around 100 times slower than operations on unencrypted data18, compared to the initial scheme by Yao, which, under awry conditions, yields data analysis results up to a trillion times slower.

Applied homomorphic encryption – Enigmas Architecture

Enigma doesn’t require a centralized architecture: Data is either stored in a blockchain19 or shared among computation nodes using Kademlias DHTs20 sets and referenced by pointers in the block chain.

 Data storage on the Enigma network

eng2

  1. A data owner O wants to off-load heavy-duty Big Data Analytics computations to the Enigma network. He sets up / obtains an Enigma script which analytically describes the computation and

  2. Uploads input data to the DHT. This is done seamlessly by splitting input data into shares that are distributed to the network.

  3. The Engima interpreter distributes computational work to Enigma nodes and uses the public ledger (blockchain) to announce computations and pointers to encrypted data.

  4. Node A performs the computation and

  5. generates a result which he

  6. stores on the public ledger.

  7. The data owner can read out the encrypted intermediary results and distribute them to other nodes or assemble the final result.

Analytic operations are performed on encrypted data on DHT-distributed data sets off the blockchain and are enabled by a domain specific, Turing-complete domain specific scripting language (DSL), modeled after popular scripting languages with the ability to annotate private data that should never be decrypted. Analysis on actual data is off-loaded and distributed by an interpreter onto multiple nodes, which perform calculations in parallel. The more nodes involved in the computation network, the more privacy and correctness-retaining the computation will be, a feature called secure multiparty computation (sMPC). This means, no single party involved in the computation ever has access to the data in its entirety but only to seemingly random, encrypted pieces of the data.

 Distributing work on the Enigma networkeng1

A data owner O is seeking for in-depth analysis of her/his data sources.

  1. She/He provides a detailed description of his data entities to an analyst A. A designs analysis algorithms on the generic description he received, like calculating the average age of persons, clustering them into groups by their service consumption habits, etc. using Enigmas’ domain specific script language. The analyst returns the set of generic analysis specifications (Enigma scripts) to the data owner O.

  2. The data owner is now able to insert his real-life data into the calculation templates, but the calculations are computationally involved so he is seeking to off-load them to the other Enigma nodes.

  3. The enigma interpreter will encrypt data marked as private and

  4. distribute the data in a secure DHT storage scheme across Enigma computation nodes. Pointers to data, proofs that the computation occurred and executed correctly, are stored in the blockchain. Nodes A, B and C participate in the computation which is performed on encrypted data. The Enigma interpreter at the data owner O controls the computation: He checks whether nodes behave correctly and distributes further work to well-behaving nodes as intermediary results become available:

    1. Node A calculates part of the analysis and results are stored encrypted on the blockchain.

    2. Node B calculates part of the analysis.

    3. Node C calculates part of the analysis.
      Calculations on nodes A, B and C happen in parallel and, for security reasons, partly redundantly.

  1. The data owner O assembles the results, correlates them to the analysis descriptions and interprets them.

Correctness of computations is attained by publicly verifiable private contracts (called as such, as the contracts content, the computation/analysis, is not publicly inspectable in contrast to smart contracts21), enforced by the SPDZ protocol22 using a message authentication code (MAC) together with the commitment in the publicly verifiable blockchain. This guarantees correctness even when the majority of participants behave dishonest.

Privacy is ensured by somewhat homomorphic encryption (SHE) and a secret sharing scheme (LSSS), where the secret s is divided into n and requires t+1 parties to reconstruct the secret23.

Participating in Enigma – Incentives for nodes

Despite advances over early implementations of homomorphic encryption, operations on the Enigma network are still computing-intense and require coordination. Therefore, nodes are encouraged to participate by receiving Bitcoins for performed operations. In order to participate as an Enigma node, Bitcoins have to be deposited, which in case of other nodes detecting malicious operations such as correctness breaches, will be withdrawn and shared among the benign nodes.

Big Data Analytics, the Public Administration, Privacy and Enigma

The bits and pieces of somewhat homomorphic encryption, distributed computing, and publicly verifiable yet private contracts solve the problem of Big Data Analytics in public administration as:

  1. Analysis data is only meaningful to the data owner. This eludes the necessity for anonymization or pseudonymization of personal data which is time-consuming. In addition to that, pseudonymization is potentially reversible24;

  2. Computation is distributed to many nodes. BDA is (often) a time-intense operation. Distributing operations to many nodes can leverage benefits of distributed/grid/cloud computing;

  3. It supports open innovation. Even the most personal and unabridged data sets can be released – encrypted as described – as open data to the public. Analysis can be performed by everyone, who has been provided with the metadata descriptions of the affected data sets. Still, analysis results will only make sense to the data provider.

Discussion and Next steps

The Enigma system makes perfect sense to be used as the underlying principle to perform (big) data analysis in public administration. In times of public austerity, Enigma enables using large-scale IaaS commodity hardware provided by well-known cloud operators to perform heavy-duty BDA while retaining privacy. To facilitate this operational model, Enigma would profit to be released as a Docker25 or Docker-like image to be quickly available and easily deployable for public servants. Even if this technology is relatively new, data brokers can assist public administrations to deploy Enigma nodes to perform distributed computing in a totally save way using commodity IaaS hardware. This would also facilitate to shift the traditional role of the public sector IT department away from an IT operator towards the data analyst and to leverage the heralded benefits of BDA26.

However, as Enigma calculated analysis results make only sense to the data owner/provider, the public will be less interested to devote leisure time to pick up freely provided data to identify yet undiscovered patterns. In spirit of an open data challenge27, performing an open data analysis challenge would make sense. For this to be effective, high quality descriptions of public administrations data entities are required, so analysis descriptions using Enigmas’ scripting language can be crafted without actual access to the underlying data. An intermediate third party, trusted both from the public and the administration, would be required to judge contest winners. An additional incentive to this model would be that ownership of an Enigma script could be declared (e.g. by signing the script with a key) and if this private key to verify ownership during computing is available to the Enigma interpreter, Bitcoins in excess to Enigmas’ built-in incentive mechanism get transferred to this account. Bitcoins (or any cyber-crypto currency, as remuneration is actually independent from the underlying working of Enigma) would be the currency to compensate the data scientist who crafted the analysis (did the mental work) and those who donate computational power.

To bring Enigma technology to the masses, next steps would be:

  • To code an Enigma client which can be installed on typical end user hardware to perform Enigma computations and to receive compensation for performed calculations;

  • After a critical mass of installed Enigma nodes have become available, to design a research project which highlights the benefits of distributed BDA for the public administration;

  • To provide Enigma powered Docker images so institutions can participate in the Enigma network.

Unlike other P2P networks, which have to rely on many participants until they reach economies of scale for participants, Enigma does not expose the cold-start problem, as the very first participant will already profit from Bitcoins transferred for donating computational hardware.

Enigma offers great potential to become an effective mean for distributed BDA while retaining privacy. Detailed analysis of Enigma together with the legal data protection framework is necessary to be able to classify the data used within Enigma in the sense of data protection law. From this classification, legal requirements can be derived, which have to be met by natural or legal persons acting within the Enigma network. For now it is postulated, that the concept of Enigma suggests that in no circumstances the data subject can be identified for those acting as service providers. Still, for the public administration, the data remains personal, which leads to the conclusion, that there are further clarifications needed. This article tried to make a minor contribution to how the public administration could profit from Enigma and exhaust its full potential.

Update 2015-07-29: Thanks to comments from @GuyZys, co-Author of the paper describing the technical details, clarify that Enigma uses somewhat homomorphic encryption SHE, i.e. it implements the principles of HE yet uses secure multiparty computation sMPC.

Footnotes

2V. Mayer-Schönberger and K. Cukier, Big Data: A Revolution That Will Transform How We Live, Work, and Think, 1 edition. Boston: Eamon Dolan/Houghton Mifflin Harcourt, 2013.

3D. Laney, „3D Data Management: Controlling Data Volume, Velocity, and Variety“, META Group, Stamford, Connecticut, 949, Feb. 2001.

6„Big Data: Seizing Opportunities, Preserving Values“, The White House, Washington D.C., Feb. 2015., p. 8

7Y.-R. Lin und D. Margolin, „The ripple of fear, sympathy and solidarity during the Boston bombings“, EPJ Data Science, Bd. 3, Nr. 1, S. 1–28, 2014.

9C. Yiu, The big data opportunity: Making government faster, smarter and more personal. London: Policy Exchange, 2012.

11 Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, OJ 23 November 1995, L 281/31.

12 Federal Act on the Protection of Personal Data (Data Protection Act 2000 – DSG 2000), BGBl I 165/1999, version BGBl I 83/2013. (in the following abbreviated as DSG); a non-authentic translation into English is available at the website of the data protection authority http://www.dsb.gv.at/DocView.axd?CobId=41936

13 Cf. Knyrim, R., Big Data: datenschutzrechtliche Lösungsansätze, Dako 2015/35, 60.

14 Terms used e.g. by Knyrim, R., Big Data: datenschutzrechtliche Lösungsansätze, Dako 2015/35, 59.

16http://enigma.media.mit.edu/ (retrieved on July 22, 2015)

17http://ciphermachines.com/enigma (retrieved on July 14, 2015)

22 I. Damgard, M. Keller, E. Larraia, V. Pastro, P. Scholl, und N. P. Smart, „Practical Covertly Secure MPC for Dishonest Majority – or: Breaking the SPDZ Limits“, 642, 2012.

24Y.-A. de Montjoye, L. Radaelli, V. K. Singh, and A. “Sandy” Pentland, “Unique in the shopping mall: On the reidentifiability of credit card metadata,” Science, vol. 347, no. 6221, pp. 536–539, Jan. 2015.

26J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Hung-Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, Washington D.C., Jun. 2011 [Online]. Available: http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation. [Accessed: 10-Aug-2012]

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s