Small-to-Medium Scale AI Data Brokerage - Gathering, Refining and Selling


Founder of Emerging.Technology
Staff member
Jul 8, 2018
Received Likes

Small-to-Medium Scale AI Data Brokerage - Gathering, Refining and Selling

The gathering, refinement and sale of data by individuals and small groups is one of the most widely overlooked ways to generate an income online, with this kind of activity being considered to be the realm of large data brokers and the likes of Google and Facebook, or being something which is all together unknown.

It's common knowledge today that data is valuable - and not just a bit valuable.. but instead the most valuable and abundant resource that can be monetized.

Particularly within the tech sector, data is the fuel that allows for many, if not almost all, of the amazing new advances we're hearing about all the time. This goes for all emerging technologies - Blockchain and Cryptocurrency, VR and augmented reality, the Internet of Things.. but in particular, one industry that has an insatiable hunger for high-quality data is Artificial Intelligence.

AI is becoming ubiquitous and this is directly as a result of the oceans of data that are able to be mined, refined and fed into neural networks and other AI and machine learning models in order to create the kinds of applications that are now outperforming their human counterparts in a broader range of ways every day.

While this ETM (Emerging Tech Monetization) Method can be applied to many other technologies as well, it is particularly applicable to the field of AI because of the huge demand for data that is every-present.


The goal of this method is to find individuals, businesses and organisations that are prepared to pay in order to acquire a given dataset or type of data, and then to gather that data, refine it to the level that is required for the buyer's satisfaction, and then to sell it to them at a profit.

Step One: Identifying Demand

There are a wide range of different businesses, academic organisations and startups that are searching for the right kinds of data to use for their purposes.

These purposes may be to train a neural network that is the core of an app that a startup is launching, to train a machine learning model that will be used in an academic paper by a team of grad students, or maybe be to create a tool that a business can use to better understand their customers by building a model based upon data from similar kinds of consumers in different region of the country.

While in many cases for startups and academic projects, members of the team will have the technical knowledge to gather and process their own data, this doesn't mean that there doesn't exist an opportunity to gather the data required on their behalf in order to increase the efficiency of their project.

Also, this isn't to mention that often a team will simply be unable to source the data they require, where purchasing it from a smaller broker such as yourself may be a fraction of the cost that that would have to absorb by hiring a larger professional data broker.

The key is to:

  • identify places online where you can easily and quickly find requests for a given type of data​
  • understand if there is a specific type of data that is regularly in demand from a lot of different groups or projects​
  • have connections within organisations that regularly require data and that know to come to you in order to source it​

Data markets are a great way of finding demand for certain types of data, and here on Emerging Technology we provide a section in our data market specifically for finding requests for data at the Emerging Technology Data Requests section of the data market.

Here members are able to create listings for the kinds of data that they're searching for, and this provides a useful and fast way to be able to identify opportunities for profitable data collection.

Step Two: Assessing Sources

Once you know that there is an organisation or individual that will pay you if you can acquire a given type of data, the next step is to search for sources of that data and to ascertain whether it will be possible for you to gather it.

The complexity of this part of the process may be anywhere from extremely easy all the way up to completely impossible, depending on the type of data needed, the quality of the refinement of that data that is required, the cost of acquiring the data, and in some cases, the legality of acquiring the data.

A big consideration at this point is whether you are able to acquire secondary data to supply to the client, or whether it will have to be primary data.

The difference between these two categorisations is that secondary data has already been gathered by someone else, and now just needs to be bought or downloaded for free, and then sold. Contrary to this, primary data is data which you yourself will have to source, process, store and transfer in order to be able to sell it.

In almost all situations, being able to source secondary data is faster, easier and less risky than having acquire primary data, and often the value of this data can be comparable to primary data.

Where there can be a significant difference is when they secondary data is so trivial to acquire that its value is clearly less than other types of data - for example, selling someone the MNIST dataset would not be likely to generate a great amount of revenue considering that it is widely available for free online. In fact, it is almost certain that this data set has no monetary value (in its raw state at least).

The dataset that was gathered by Cambridge Analytica during their highly profitable scrape of Facebook user data contained up to 5,000 datapoints across millions of different users. While collecting this data turned out being a huge mistake on the part of CA, the value of this data, especially to the political parties that were trying to use it optimise their election marketing, would be significant because of the rarity of being able to gather that kind of dataset, combined with the potential value of being able to use it.

These are the kinds of considerations that need to go into the process of planning how to source data after ascertaining demand for it.

Step Three: Data Gathering

Data gathering may either be the most complicated part of the process or the least, depending on the range of different factors mentioned in the previous section.

Once you know what you are trying to acquire, data gathering can typically be achieved in one of a number of different ways:

  • (Secondary) Downloading a dataset that is provided freely online​
  • (Secondary) Purchasing a dataset from a broker or other source at a price that is less than you intend on selling it for​
  • (Secondary) Scraping a dataset from a public webpage (consider the legality of this first however)​
  • (Primary) Pulling data from an API or websocket​
  • (Primary) Purchasing data from a business or user(s) at a price that is less than you intend on selling it for​
  • (Primary) Scraping raw data from a public webpage (consider the legality of this first however)​

The upside of this part of the process is that it will most likely be very obvious which method will be needed in order to gather the data. This can, however, also be the most time-consuming step, especially if a lot of coding is required in order to be able to scrape the required data or to pull it from either a REST API or websocket.

If you have completed a number of jobs already you may well have scripts and bots already coded up that require only a relatively minor adaption in order to be able to be used again to gather this new dataset though, and this is why being meticulous in recording the work you've done previously and the code that you have on hand can dramatically increase your efficiency.

Step Four: Data Processing

Once the data that you need has been acquired, it may need to be refined depending on the state of it when you receive it, as well as the requirements that a potential buyer may have given you.

If the data that you now have is within the scope of what you will need in order to be able to complete a successful sale, then you can skip this section now.

If you need to refine your data, then there are a number of considerations that you should have before-hand including (most importantly) ensuring that you have a backup of the raw data that you have acquired prior to starting any work on it.

A backup can be made easily by copying the spreadsheet or file and keeping it in a safe part of your system. You can even store a copy on an external hard drive for extra safety of your precious data.

Next you need to identify what the state of your raw data is, and which manipulations of it need to take place in order for it to get to the desired state you'll need in order to sell it.

There are a range of ways of refining your data, and these typically either fall into one of two categories: Cleaning and Processing

Cleaning Your Data

Cleaning your data is the process of modifying it in ways that don't alter the values stored within it, as much as it is removing an anomalies that should not be in the data or modifying the dataset in order to improve its organisation.

Common ways of cleaning data might be by removing all blank cells or "null" values, removing duplicates, checking the spelling of words within the dataset, normalising the data, de-capitalising or capitalising the data, and removing any characters that might be undesirable such as symbols or punctuation.

This is a process which is important to be done carefully, because especially with large datasets it can be easy to make a mistake which totally corrupts the value of the data, while being very hard to identify if you have made such errors.

Processing Your Data

Unlike data cleaning which is removing things that should not be in the data, data processing is the process of taking raw data and then performing some kind of operation or operations on it in order to gain new data that often provides deeper insights into the meaning of the data.

Although it may often be common that a buyer will just want the raw data, there may be projects that would either like data that requires processing, or data which the explicitly ask to be processed.

Although the drawback of having to process data is the increased time and complexity required, it is reasonable to apply higher costs to the data that you sell if it has been processed, especially if the end product is particularly uncommon and can be used in an application that has a high financial potential.

Step Five: Selling the Data

This is the good part - time for the payoff after going through the effort of finding the data and processing it.

There are either going to be one of two scenarios happening at this point: you've sourced the data with a specific buyer in mind, or you've sourced the data with the intention of finding buyers after you have it.

If you already have an agreement with a buyer then the process of packaging it for transfer is typically fairly simple. It's your choice whether you ask for payment first, or provide the data first. The option of using Escrow is always there, especially if the data is highly valuable.

Another option is to provide a few entries from the dataset as proof that you have the data, and as a sample for the buyer to assess. After you have sent that to them, they can then pay for the data, after which you can release the entire dataset to them.

If you have sourced the data without a specific buyer in mind, you have the option of searching and outreaching to potential buyers, or you can list the data that you've collected in a data market.

As with the Data Requests section of our data market that I mentioned above, we also have a section to allow data brokers to list any data which they have for sale at the Emerging Technology Data for Sale section.

Here you can upload your dataset, which we will then virus check in order to ensure it is safe to be sold. We will also confirm the description of the data which you're selling, and then add your listing for members of the community and external data buyers to come and find.

Step Six: Process Optimization

The final optional step of the process is to look at how you can optimise your methodology in order to improve your efficiency and effectiveness, and to cut your costs.

For example, if you are collecting a lot of data from cryptocurrency exchange APIs and you know that you will be collecting more in the future, then designing the script you're using to pull data to be versatile and to be able to be used for a wide range of APIs with only minor changes will allow you to reuse the code you have and to save a lot of time.

If there's a way of cleaning your data you keep having to use, then create a system where you know exactly what you need to do ahead of time in steps and aren't going to need to even think about the process that much.

As with anything, optimisation can have a big impact on the way that you're operating and can be one of the easiest ways to improve your profitability in the long run.

Thank you for reading through this guide on the Emerging Tech Monetization (ETM) Method known as Small-to-Medium Scale AI Data Brokerage.

ETM is a brand new way of making money with emerging technologies like AI, VR, cryptocurrency, blockchain, 3D Printing, Internet of Things, and more, and if you're interested in technologies or if you'd like to start making money online in this exciting and cutting edge industry, why not Join our Community and start getting involved with ETM!

If you'd like to contact us directly, please use the Contact Link. If you'd like to learn more about ETM, check out The ETM Manifesto and The ETM Genesis Guide.
Last edited: