I am currently involved in a project to assess and improve the quality of (meta-)data on open data portals. My fine colleagues at Vienna University of Economy showcased their current state of affairs on PortalWatch and presented some yet to disclose statistics on data quality of checked portals. That day I learned that Socrata powered open data portals store all their data in a proprietory data vault as opposed to CKAN, where internal storage is optional. Storing data within one system has it’s advantages like uniform access for analysis. But in the long run it is not such a great idea.
Neglecting Big Data (Principles)
A vast amount of data available on data portals, despite recommendations , , is still not API-based. This is unlikely to change significantly, given recommendations of the W3C Data on the Web Best Practices working group to provide bulk download. While data storage costs are declining largely modeled after Moore’s law, so does the data volume increase. In other words: data storage remains costly and of an issue, as is availability, one element of data quality.
Storing data in a central place is awesome … unless it creates additional efforts. And this is my major gripe with the Socrata approach as it violates one important principle:
Public Sector Information and Open Data are released by the public administration while pursuing their regular tasks.
This doesn’t hold true for the many Socrata-based data portals, which store data internally as measures have to be in place to first export data from backend systems, just to shovel that data into another data coffer. This doesn’t make much sense as data get’s duplicated, requires additional effort and is prone to errors. Additionally it neglects the Big Data principle to integrate heterogeneous data sources without having to first ETL them. Big Data, among many things, is about distributed, heterogeneous data.
Proven solutions to data distribution
Currently government data providers are facing multiple challenge as data should be released
- to increase transparency and accountability;
- free of charge;
- in a cost effective manner;
- in a sustainable manner.
Those are conflicting goals, especially when the public administration has to provide the storage facilities. What if bulk data instead would be stored on a public cloud?
First, let us think about the characteristics of the data provided: It is open data, thus users are free to do with the data whatever they want, no strings attached. As such it shouldn’t be of an issue to store that data outside the administrative premises, provided the data hosting service is reliable and can guarantee authenticity.
Federal hosting providers / computing centers provide reliability by as they guarantee security and bandwith. Data integrity, an aspect of security, is required in regard to ascertain that the data provider actually is the one who it claims to be. This can be assured by cryptographic hashes calculated over the provided data and released on an https-secured web site. Concerning bandwith, only a fraction of the data is likely to be frantically downloaded yet the provider hardly knows in advance and thus will provide all the data on high bandwith. A dilemma – but one with a cure.
Use the DHT, Luke!
All of the mentioned requirements and issues are solved by load-balanced DHTs. A Distributed Hash Table is a data structure which provides a lookup service and can be used to build more complex services such as a distributed file system. Many peer-to-peer networks rely on this feature to achieve:
- decentralized storage: DHTs are implemented to store the data on many more but one host. As such, they are more robust to DoS-attacks as a malicious player would have to attack many of the nodes;
- load balancing and speed: Every node participating in the network becomes implicitly a host. This has the nice feature that those data sets which are requested most, are the most distributed and also the most shared ones. In other words, the most requested data sets implicitly receive the best bandwith;
- security: communication between peers is encrypted using dynamically created private keys;
- stability as server maintenance doesn’t affect data sharing. With distributed and load balanced data access, there is no single point of failure.
- monitoring: The intensity by which data get’s shared can be easily monitored by provider and users. Download statistics would both give providers and users a clue about usage patterns. The later ones are often left outside alone to obtain usage statistics.
Alternatives to consider
The most widespread DHT-based implementation is the BitTorrent network. While BitTorrent is suffering a bad reputation as it is also used to illicitly share copyrighted material, the algorithm is robust and used to distribute very large files with widespread adoption. For example, Twitter and Facebook are using the BitTorrent protocol to distribute software within their data centers.
Using more intelligent data distribution solutions but mere http(s)-powered downloads gets traction, even in government or government-near fields.
Potentials and next steps
Getting go with BitTorrent as a mean to distribute open data is fairly simple. All the administration has to do is seed their data in one of the many open source BitTorrent tracker implementations, and announce the torrent link on their data portal. Instead of an http(s)-link to the data source, the link would look like
and the users would be able to download the data by clicking on the link and adding the torrent to their own download client. Torrents also provide collections instead single files, which facilitates grouping semantically fitting files together into a collection without having to zip them up.
No change to current infrastructure would be required, all CKAN-installations would support this data distribution concept without any changes. Providing a torrent is possible alongside the tried-and-proofed way to share data by linking to resources residing on institutional web servers, opening the administration the opportunity to collect experience on how this data distribution scheme works without disruptive changes.
Ideally the administration itself would deploy one torrent tracker, which could be used to distribute all sorts of data, used for internal and external data distribution, in cloud scale. I repeat here, Twitter and Facebook are doing this since years, and they can tell stories about security, speed and distribution.
A glimpse into the Future
Alternatively to BitTorrent, data could be shared using the upcoming, yet still heavily developed dat tool. In comparison to BitTorrent, dat was specifically designed to share data among distributed users with a focus on recording changes to data while retaining data ownership and change history. This enables many interesting usage patterns. A user clicking on the authoritative resource link provided on an administrative data portal, would be able to identify other users who made changes to that data sets, provided they are willing to share their changes. The user would be able to see changes made by other users, what sort of changes they made and why. Visualizations of changes over time and many collaborative features become possible.
Further Reading and Links
How to create and share Torrents
BioTorrents: A File Sharing Service for Scientific Data
Academic Torrents for sharing papers and scientific data