Data Marketplace - The proper way to access your data


There is an up and coming notion of decentralized marketplace systems, however the decentralization doesn't really refer to the data itself.

A decentralization of privately owned data is possible but not adviced, as it would increase the point of failures of its privacy by a minimum factor of 3 (1 is the Master datasource of the enterprise and minimum 2, for the shake of decentralization, the nodes that it will replicate to).

But lets not be nihilists.
In theory the marketplace can sit on a decentralized network that will act as a discovery system and/or a data set access authority system.

However in practice this has nothing to do with accessing the data in a decentralized approach.
Data sets owned by an Entity are this entities data alone, any dispute should be considered a duplication, thus requiring proof of identity to be provided and not rely on first come first serve approach.

As for the access system, the decentralized approach would mainly refer to geo-availability of the system rather than the location of the storage or the code that it relies on.

So how?

  • The marketplace stays centralized.
  • The data set stays centralized.
  • The communication becomes distributed.

There is no need for the holding service to be spread over a network. Communication of users (data providers, data consumers, 3rd party services) can be done across an API like approach that will facilitate all the needs such as: registration, account/dataset governance, distribution.

In essence the marketplace will be a broker. Users will use the broker for all the actions defined in the scope of the marketplace.

Data Access

Data Access will be done through the broker. In order to do so, the broker will have knowledge of how to connect to the Providers datasource (that could be a SQL Database, an FTP server or even an API).

Upon request of the data set by another Entity the marketplace will be responsible for acquiring the dataset from A and passing it on to B.

In this approach, we completely hide the source from the consumer increasing privacy dramatically.

Opportunities through data flow

In addition, this data flow creates the opportunity for computation to be done on the data.
The customization should only refer to structural, cosmetic or volume changes that do not alter the substance of the data.

Data Sanitization/Anonymization

A crucial point, especially after the new European law in terms of privacy (GDPR) is that a smart anonymization of the data sets can be done on the fly removing any need of filtering from the data provider.

Data set format customization

The datasource format might not be suitable for the consumer. Here is the opportunity to replace the need of parsers due to non compatible data structures or file formats and provide the ability to select the output format.

Data set row level alteration

One last example can be of very low importance but the applications are endless.
Date formats. Data sets originating from US regions will most likely use a different format than EU.
This might appear small but actually can cause great logistical or computational errors.
Allowing the consumer to alter the format that will be done only for his use case keeping the source intact is gold.