Why Data Is Not the New Oil and Data Marketplaces Have Failed Us

How a real-time programmatic data exchange would change everything

Published in

Towards Data Science

13 min readJul 13, 2023

The phrase “data is the new oil” was coined by Clive Humby in 2006 and has been widely parroted since. However, the analogy holds merit in only a few aspects (e.g. the value of both usually increases with refinement) and data’s broader economic impact has been muted outside of a select few tech and finance companies. But the actual differences between oil and data are fundamental.

Most notably, oil is a commodity. Its quality is standardized and measurable, which makes oil from different sources substitutes (in economic terms it is a “homogenous good”). It is ubiquitous and has a well established price. Not least, if you have a barrel of oil, you can’t simply make a copy to produce another — oil is a limited resource that has to be pulled from the ground.

Data, on the other hand, is a heterogeneous good. It comes in unlimited variety and the value of each occurrence cannot be measured objectively. When two parties exchange a good, the seller has to set a price and the buyer has to establish their willingness to pay. This is complicated by two attributes of data:

The marginal cost of selling the same data to another buyer is zero. The cost of producing data is highly variable (sequencing a genome is more costly than taking your temperature), but once it exists, that cost is sunk. The process of selling it to another buyer is the simple act of copying it which, for all practical purposes, is zero.

It is hard to establish the value of data without “consuming” it. A database of sales leads is only valuable if it results in actual sales. To make things worse, the value of the exact same dataset is highly dependent on the buyer (or its intended use). In this regard, data is actually closer to “experience goods” like books or vacations.

In this post I will argue that data is one of the most underutilized and, as a result, undervalued goods. I outline a real-time programmatic data exchange that is at the heart of a new company I am advising and could have a profound impact on the data economy.

“Data is one of the most underutilized and, as a result, undervalued goods.”

Why should we care about the economics of data?

Distinguishing between 1st and 3rd Party Data

No one I know argues against the importance of data. But even though the narrative of “data is an asset” has become quite common, data is probably one of the most underutilized and, as a result, undervalued goods.

When most businesses think about data, they think about data they own. This 1st party data (1PD) is usually collected from websites, CRM/ERP systems, correspondence with customers, etc. Some 1st party datasets are more valuable than others: Google’s trove of search and click history is part of their 1PD corpus.

What should be obvious is that the amount of 3rd party data (3PD) in existence, which is data you don’t directly own, is several orders of magnitudes larger than your 1PD. The argument I will make is that most people don’t realize the value of 3PD to their business. Let’s use an example to illustrate this point.

Detecting email spam (and why your 1PD alone may not be as valuable as you think)

What do you think is the most predictive signal in detecting email spam? The most common answers include: typos, grammar, or mention of specific keywords like v1agra. A slightly better answer is “if the sender is part of your contacts or not” — not because it’s true (there are more valid senders of non-spam off your contacts than on it), but because it considers a data source outside of the email itself: your contacts.

If only for the purpose of this anecdote, let’s say that the most important signal in detecting email spam is actually the age of the domain of the sender. Once stated this seems intuitive: Spammers frequently register new domains that, in short notice, get blocked by email providers.

Why don’t most people think of this answer? Because the age of the domain of the sender is not part of your “1st party dataset”, which only contains things like the sender’s and recipient’s emails, the subject, and the email body. But everyone who knows something about domain names will tell you that this information is not only readily available but also free. Take the domain, go to a domain registrar, and you can find out when it was registered (e.g. gmail.com was registered on August 13th 1995).

As it turns out, the data you own (1PD) is probably much more valuable to you if it is augmented with data someone else owns (3PD).

From email spam to quant trading (and beyond?)

Extrapolating from the idea that you can detect email spam better simply by augmenting your dataset by the age of the domain of the sender, you can imagine that there are infinite ways you can apply the same principle. Below is a simple example of the data you can find from an address (at least in the US).

Of course, this is not a new idea. Hedge funds have been using ‘’alternative data’’ for decades. RenTech was one of the first companies utilizing alternative data like satellite imagery, web scraping, and other creatively sourced datasets to give them an edge in trading. UBS used satellite imagery to monitor the parking lots of big retailers and correlate car traffic with quarterly revenue, allowing more accurate predictions of earnings before they were released.

You can probably guess where this is going. There are over 300k data providers in the US alone, and likely billions of datasets. Many of them could give you a competitive advantage in whatever you are trying to predict or analyze. The only limit is your creativity.

The (subjective) value of using external data

While the value of external data to quant trading firms is immediate and significant, executives in other industries have been slow to come to the same realization. A thought experiment helps: Consider some of the most important predictive tasks for your business. For Amazon, that could be which product a given customer is most likely to purchase next. For an oil exploration company, it could be where to discover the next oil reservoir. For a grocery chain, it might be the demand for specific products at any given point in time.

Next, imagine you had a magic dial that you could turn to improve the performance of that predictive task and the resulting value to your business. Grocery chains lose an approximate 10% of their food to spoilage. If only they could predict demand better, they could improve their supply chain and reduce that spoilage. At about 20% gross margin, every percentage point reduction in spoilage would improve their gross margin by 0.8pp. So, for a company like Albertsons, every percentage point improvement in predicting demand could be worth an estimated $640M per year. Alternative data could help with that.

The same data that saves a grocery chain hundreds of millions of dollars may be worth even more to a commercial real estate developer. However, data marketplaces haven’t been able to extract that value (through price discrimination) because they are far away from the actual business application. They have to put a generic price on their inventory, independent of its eventual use.

Yet, external data has managed to become an estimated $5B market growing at 50% year-over-year, and the marketplaces that trade those data represent another $1B market. This represents only a small fraction of the potential market size for at least two reasons: (1) Although every single company should be able to benefit from 3PD, only the most analytically mature companies know how to leverage 3PD to their advantage. (2) Those who dare to try are slowed down by the antiquated process to discover and purchase 3PD. Let’s take a quick detour into the ad buying process to illustrate that point.

What programmatic ads can teach us about how to improve the data economy

The evolution of the ad buying process

Not too long ago, in 2014, programmatic ad buying represented less than half of digital ad spend. How did people buy ads? They told an agency what kind of audience they wanted to reach. Then the agency looked at the publishers they worked with and their “inventory” (magazine pages, billboards, TV ad slots, …), and put together a plan of where to run a campaign to meet those requirements. After some negotiations the company and the agency eventually signed a contract. Ad creative would be developed, reviewed, and approved. Insertion orders would be submitted and eventually the ad campaign would run. A few months later the company would get a report on how the agency thought it went (based on a small sampled dataset).

Along came Google who (among others) popularized what is known as programmatic ad buying. Google created its own ad exchange (AdX) that connected the inventory from multiple publishers with different ad networks. As users performed search or visited websites, it ran a real time auction (yes, within the time it takes to load a webpage) that pitched all advertisers against each other and picked the highest bidder (actually, 2nd highest) to display their ads.

And just like that, ad buying went from a months-long ordeal with lots of humans involved and very little transparency, to a real-time transaction that both set prices (through the auction) AND gave instant measurement of impressions (and sometimes even conversions). This level of velocity, liquidity, and transparency led to an explosion in the online advertising market and programmatic ad buying now represents close to 90% of digital advertising budgets.

The antiquated data buying process

As it turns out, buying data today is even more painful than buying ads 20 years ago.

Discovery: First, you need to gain awareness of the fact that 3PD could be extremely valuable to you. Remember the email spam example? Next, you need the creativity to think of all of the possible 3PD that you could use to augment your 1PD. Would you have considered satellite images of parking lots to predict retailer’s revenues? Then you have to go to all of the data providers and search for what you think you need. You will find that most “data marketplaces” are basically just free text search over descriptions. Next you’ll have to look at the schema of the data to see if it contains what you are looking for, at the granularity that you need (e.g. sometimes you need foot traffic minute-by-minute as opposed to just hourly), and with the right coverage (e.g. for the right date range or geo region).

Procurement: Once you find what you think you need, you have to figure out how to procure that data. You’ll be surprised that it’s not always a simple “click-to-buy” affair. You have to go talk to a data provider, learn about data licenses (can you even use this data for the intended purpose?), negotiate terms, and sign a contract. You repeat that process several times for different 3PD from different providers who all have different contracts, terms, and licenses. You wait to receive the data on floppy disks in your mailbox (just kidding).

Integration: Finally you have the data you wanted. You wait a couple of weeks while your data engineering teams join it with your 1PD, just you learn that it’s not actually as useful as you had hoped. The time and money you spent are wasted and you never try again. Or, even more agonizingly, you find out that the 3PD does give you a meaningful improvement and you go on to productionize your predictive models, just to find out that you need fresh data on an hourly basis and that one of the data sources you used is only updated weekly. If you ever try again, you now know that, in addition to checking granularity based on the schema, you have to consider refresh rates.

This process can take anywhere from several months to more than a year. In an attempt to build a faster horse, some consulting firms are suggesting that the solution is to hire entire “data sourcing teams” and create relationships with data aggregators.

The data economy needs a real-time programmatic data exchange

The reason I invoked the programmatic ad buying example is my strong conviction that the data economy can evolve in the same manner, which would result in a comparably profound economic impact.

Discovery and Procurement: Consider a data exchange that brought all of the data providers (the “inventory”) together and rationalized licenses so that it could facilitate transactions programmatically. Data consumers would provide any 1PD and express the task they are interested in (e.g. predict demand) as well as the value they put on each unit of improvement (remember that 1pp of improvement in demand prediction is worth $640M to Albertsons?). The data exchange would automatically identify which 3PD would provide a measurable improvement to that task, run a real-time auction based on the data consumer’s budget, and optimally choose the subset of 3PD that meet their requirements. This proximity to the actual task (and associated value) would solve the discovery and value extraction problems of existing data marketplaces which have to treat data as a commodity and not the experience good it is.

Continuous Integration and Improvement: Because most valuable predictive tasks are continuous in nature (e.g. you need to predict demand on a regular basis and not just once), the exchange would become the center of repeated transactions that provide more value over time as new data providers and consumers enter the ecosystem. Running the auction every time you perform a predictive task (and not just once when you decide which data you want to buy) would ensure that new data providers reach distribution immediately, and that data consumers would benefit from the most recent data inventory and price discovery. Just as ad buying evolved from offline and manual, data transactions would become real-time, programmatic, and most importantly measurable.

This “real-time programmatic data exchange” would provide economic incentives for all participants in the marketplace:

Both data providers and consumers would benefit from improved discoverability. Data marketplaces have a long-tail problem: There is a massive amount and variety of data, and it’s almost impossible using existing methods to discover the most relevant data for any given task/application.
Standardizing terms and licenses, so that transactions could happen programmatically, would improve the velocity and liquidity of the data economy, eliminating friction in the purchasing process and opening it up to a broader audience. As a result, the overall market would expand significantly.
By setting price in an auction based on the subjective value for each data consumer, consumers get a better deal if there are multiple data providers with comparable data, and providers can price-discriminate across consumers who value the same kind of data differently.
Aggregating demand from data consumers on one platform would provide invaluable insights for data providers. E.g., given all of the tasks and willingness-to-pay from the demand side, the data exchange could infer exactly which data is missing from the provider side, helping prioritize data acquisition and creation. Take note, synthetic data providers!

Hard problems that need to be solved

In addition to solving discoverability and pricing for data, much like what Google did for advertising, this programmatic data exchange also needs to tackle licensing and delivery, not unlike what Spotify did for music. But if there weren’t a number of hard problems to solve it wouldn’t be as interesting and meaningful of an endeavor.

Commercial

Data licensing is relatively new. From what I can tell there is not a lot of standardization in data licensing. Every data provider has their own special flavor of licenses that are incompatible with others. In order to facilitate an exchange, licensing needs to be streamlined.
Data marketplaces may fear disintermediation. The data ecosystem is complicated. For data providers, this would be an entirely new distribution channel. They are painfully aware of the discoverability problem and this exchange may open up the market to millions of new consumers who wouldn’t otherwise have considered alternative data. Data marketplaces and aggregators, on the other hand, are the record label equivalent that may want to block direct access of data providers to a programmatic data exchange.
Introducing a new pricing model to an “old” industry is hard. The liquidity mechanisms of a programmatic exchange would significantly expand both the demand and supply side, and the pricing mechanism would optimize value capture. In aggregate, a programmatic data exchange would be a win for data providers.

Technical

Semantic type detection is stuck in the past. In order to automatically identify which datasets could be joined, you first need to understand the semantic type of data. E.g. is something just a number, a zip code, or a currency? Most semantic type detection is heuristic based, but there are more modern approaches.
You can’t brute-force data discovery. It turns out that there is a lot of data. The naive approach to finding out which 3PD most benefits your task would be to simply “try out” all of the data to identify which one provides most value. Thankfully there are modern breakthroughs in fields like information theory and data summarization that make this problem tractable.
Joining data is hard. Once you know semantic types and you have a mechanism to identify which 3PD would provide a meaningful benefit, you have to join 1PD and 3PD in interesting ways. Weather data may come with the longitude and latitude of the weather station that doesn’t match the airport you want to predict flight delays for. Or foot-traffic data may come hourly and you need to figure out whether you want to use an average, max, or nth percentile for your daily aggregate.
Data security. Data providers don’t like giving away their data (because it can be replicated so easily). However, there are techniques (like federated learning) that allow augmentation of predictions while preserving data access and privacy.

I believe that the impact of a real-time programmatic data exchange will be profound, and thankfully, recent advancements in AI provide solutions to the challenges outlined above. I, for one, look forward to a future with data as an experience good as opposed to a commodity.

Opinions expressed in this post are my own and not the views of my employer.

Why Data Is *Not* the New Oil and Data Marketplaces Have Failed Us