Web Data Is A Commodity And Needs A Marketplace To Grow
The Advantages Of Central Markets
About Data Boutique
Data Boutique is a web-scraped data marketplace.
If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.
Join our Platform to learn and interact about this project:
Web Data Is A Commodity
Information on public1 web pages is accessible to anyone at no cost from anywhere in the world with an internet connection.
Some people pay other people to have this information in a nice and convenient format. It’s called web scraping, and it’s now becoming a commodity.
Being a commodity means that if we ask 100 web scrapers to collect the price of this product page on the exact same day (June 19th, 2023), we obtain 100 times the same answer: 3.350 USD. There is no difference among any of those 100 scrapers.
This information has been publicly disclosed by the website louisvuitton.com, and they made it accessible to anyone.
For those who paid for the scraping, the value of this information does not change based on who collected/scraped it.
Even if louisvuitton.com itself were to offer a file with this data, it would not embed any additional value (= it could not be charged a higher price) than independent web scrapes at like-for-like conditions.
According to Wikipedia
[…] a commodity is an economic good, usually a resource, that has full or substantial fungibility: that is, the market treats instances of the good as equivalent or nearly so with no regard to who produced them […].
So yes, web data has the features of a commodity.
The Need For A Market
Ok, it’s a commodity. Why does this matter?
It matters because this technology is:
Growing (thanksfor the heads-up on the influence of AI as an accelerator for web data)
Still in its Wild-West phase: The availability of technology and internet access has spread the access to web scraping at a global scale, and literally, almost everyone is doing it, planning to do it, or failing while trying to do it.
So, while current web scraping solutions focus on the technical side, creating a common marketplace is only in its early days.
Let’s see why we are encouraging this change in the industry and what the benefits of a data-as-a-commodity market are:
The enormous fragmentation of web data (everyone can access it, and the tools for achieving it are available to anyone, like) is causing a great heterogeneity in formats, quality, types of data, and much more.
An independent entity that serves as a regulator will benefit the buyer: Quality standards on data, collection processes, and monetary transactions. To unlock the global adoption of web data, the industry needs to instill trust among all actors.
Without a marketplace, all due diligence and all quality controls will have to e conducted at every purchase, as happens today.
Rules and regulations only serve one purpose: attracting a larger audience and creating a bigger market. In other words, liquidity:
Liquidity for the buyers means they participate in the market as they find what they need at the price they are willing to pay.
Without a marketplace, every buyer will have to negotiate individually with providers (often internal) the price for the acquisition.
Liquidity for the sellers means they can find multiple buyers for the same dataset they are collecting.
Without a marketplace, extraction costs could not be spread among multiple customers, and there could be no productization.
Liquidity (which can be enabled only by regulation) is the prime and most powerful factor in ensuring buyers and sellers are active at the Fair Market Value for whatever goods or services they are trading on, and web data makes no exception.
A data pipeline can be complex. While our first example of the Louis Vuitton bag was simple, things get ugly pretty fast: Cross-website terminology, language differences, and changes in history, just to name a few, make data usage very painful.
Raw2 web data is nice, but more needs to be done to make it useful.
One great advantage of having an independent marketplace for data is that you can have sellers offering raw data and other sellers deriving products based on the original raw data.
Without a marketplace, the transformations performed on raw data will have to be commissioned and negotiated case by case. A market for derived work enables competition and quality of derived work.
Let’s give an example. We have raw data from product reviews from an electronic e-commerce store. The data would contain each review with the comment, the user, the time, and the final rating.
A derived work would be a sentiment scoring for each review, and it would be provided by an independent seller, in line with the regulations of the marketplace, so that the buyer of this derivative work has a better knowledge of how the information was treated (is it in line with European GDPR or the CCPA?).
Web Data: Buying Instead Of Scraping It
The core implication of having a marketplace is that businesses can buy web data instead of scraping it (or paying for a scraping project), and scrapers can scale their operations and efficiency.
The advent of AI has pushed the need for fast, large-scale, high-quality data, and internal production cannot keep the pace of this scale. Wild-west mode is not fit anymore for current data needs.
A well-functioning marketplace with transparent, independent regulations and quality standards will enable faster, safer exchanges of web data, promote ethical data collection and create a new ecosystem for data treatment, which exists only within large corporations today.
We are building this at Data Boutique: An independent place where web data is handled and traded like a commodity. Follow our project and share this post if you want to learn more.
Join the Project
That was it for this week!
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.
Data displayed on web pages accessible without the need of using credentials, in anonymous mode, not blocked by clickwrap terms of service
The definition of “raw” here is not in the raw HTML code but in collecting items as shown on the website, with no further transformation.