Expensive Data and Cannibalism

Or.. Why some data products are priced so high

Aug 04, 2023

About Data Boutique

Data Boutique is a web-scraped data marketplace.

If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.

Join our Platform to learn and interact about this project:

Join Data Boutique

Why Are Some Data So Expensive?

Is the price of a dataset linked to real costs, or is it driven by other forces? Why are some datasets so expensive - even in web scraping?

Some proprietary datasets have artificially high prices because the buyers want to hold competitors off, and pay a (loud) extra for exclusivity.

Other datasets instead are offered at high price points, limiting the number of buyers, even if:

There is no exclusivity (anyone can access it)
There are plenty of alternative sources (web-scraped data is the perfect example)
The ROI for the final user hardly justifies those price points

Until some time ago, I would think: they’re expensive to collect (large website, hard to access). Thus customers pay more for having it. As simple as that.

But that is only partially true. Data is an asset with unlimited inventory: You can make a copy of a dataset as many times as you want. Pricing it high seemed myopic: You will sell to the few willing to pay for it, and give up the rest of the market, who will start scraping it in-house or give up on the idea, and will never ask of you again in the future.

The Satellite Imagery Story

Then I understood it. It’s not a pricing choice made by providers. It’s a pricing trap they’re stuck in.

I attended an alternative data conference in New York in 2019. During a speech, the question arose: Why is satellite imagery still so expansive, even after all these years?

The answer is disarmingly simple: It is expensive because once priced the first time, it’s trapped by the product cannibalization risk.

Let me explain: Sending a satellite into orbit is expensive. Even considering Space-X contribution to lowering payload costs per kg. Government contractors were the first clients that could afford that, also given military implications. And they did that at government-contract prices: High enough to repay the satellite costs and make a profit.

Now they’re trapped: No one other than the government and few hedge funds would buy at those prices, but they can’t drop the price, as the drop would have to be so significant that existing clients would switch to the new price.

Result: Immediate net revenue loss. The exact definition of product cannibalism.

The 5 factors that drive pricing

This happens in web data too. I often find datasets priced very high, even if the underlying websites are so popular (to see how popular a website is, visit

The Web Scraping Club

’s Discord channels to see all the usual suspects).

Product Cannibalism is among the 5 factors of pricing a data product.

The history of pre-scraped datasets is not so different than satellite imagery. The first massive data collections of large-scale websites, such as Amazon, were (and are) quite expensive to perform, considering the entire proxy infrastructure. Web scraping companies custom-built these extractions for large corporations at a high price point that few others could afford.

As time passed, more customers asked for similar extractions: Selling pre-scraped datasets started making sense. But the cannibalization risk posed a serious threat: If pre-scraped datasets were priced too low, existing clients would switch to it. Even if, in the long term, you’d get more sales with the increased customer base, the short term definitely looks like revenue loss.

This way prices stay high, a privilege of a few large corps.

On Revenue Loss

If the data is proprietary, staying on the high end of the price spectrum can have consequences limited to long-term revenue stagnation: No/few new customers with little revenue growth. If your data is relevant to the buyer and no alternative comes out, you can keep the price high and live on.

But… If the data is not proprietary or is replaceable by someone else… it’s a conversation providers need to have. Quoting Steve Jobs, “If you don’t cannibalize yourself, someone else will.”

Web-scraped data, especially from public websites, is a commodity: Anyone can hire a freelancer on Upwork and assign the web scraping.

Providers are already losing more revenue because of the in-house alternative than they’d lose for product cannibalism.

Exiting the Trap

Exiting the price trap can be resolved by taking two actions: On the custom-extraction end, ensure custom features add real value, provide a higher level of service, and enrich data more: There must undeniable advantage in that extra cost when compared to standard datasets.

On the other end, provide datasets - raw, unenriched, as they are off the exhaust of the scraper - at fair market prices.

With the right timing, this move will ensure the right unit volumes for these datasets and a bigger share of the new, larger audience. When we started the marketplace, we were not looking for lower prices. We were looking for a bigger market.

Join the Project

That was it for this week!

Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.

More on this project can be found on our Discord channels.

Thanks for reading and sharing this.