Discover more from Data Boutique
Should Websites Open Access to Data?
Looking at web data from the website perspective
About Data Boutique
Data Boutique is a web-scraped data marketplace.
If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.
Join our Platform to learn and interact about this project:
Why Does Web Scraping Exist?
Web scraping - a series of techniques aimed at collecting data from websites - exists because… there is no easier way to get data.
Websites publicly expose information because they need to operate (exposing the price of a product they want to sell, or publishing a job opening as they need to recruit).
Other actors - not directly interested in buying a product or applying for a job - find this information useful and want to collect it. Maybe they’re competitors, suppliers, business partners, market analysts, AI developers, or smart users wanting to outwit the algorithm.
Since they need this information, and it is publicly available, technically feasible to collect, and in many cases legalto do so, they go ahead and collect it.
In the absence of a dedicated place to get this data, this is done from the same interface designed for selling a product or promoting a job post - surely not for massive data access.
This causes pain to both sides: Websites get overloaded with the traffic they don’t need nor want, and web scrapers fight to hammer something originally designed for a completely different reason into a usable data format.
It would all be so much easier for everyone if websites built a separate place where to get this data.
Should Websites Actively Give Away Data?
To a certain extent, no, they shouldn’t. Information is power, and obstructing access to data helps preserve it.
Sometimes they need to: They want Google, other search engines, and product directories to access it freely and easily to have their products ranked;
It is extremely difficult to enforce: You can make web scraping more expensive, not block it;
The harder you try, the worse the experience gets for the designated user (hello, Captcha!)
So maybe the answer should be “yes”, since they can’t avoid it happening anyway, and there could be more advantages in doing so, than not.
Why it Makes Sense to Open Data Access
What would the advantages be, when having a structured outlet with public data directly provided by a website?
Well, by not doing it, they would not prevent data from spreading anyway. It’s invariant from the data availability standpoint. The data exchange business exists whether or not websites take part in it. At least get into it when it’s your data that’s being traded.
Monetization opportunity. Opening data access does not necessarily mean it has to be for free: Web scraping costs time and money, so it would make sense to decide to charge a fee for having the data. And if the website is in high demand, this could become an additional revenue stream. This becomes especially true with the volumes linked to the rise of AI.
Bot reduction: When we open a new data endpoint, the need for bots that struggle to bypass antibot measures will decrease, thus releasing the pressure on the website. This wouldn’t take away the need for anti-bot measures, as there will still be the need to protect from other malicious fraud attempts, but still, it would have a serious impact.
Transparency: An e-commerce operator is perceived as more transparent when sharing first-hand information with business partners, suppliers, investors, and customers. An advantage would be that you know what numbers your stakeholders are looking at (and not be surprised by investors’ independent research). Again, this might seem scary, but the outer world - competitors included - has this data anyway (that same dynamic price management tool you are using..).
At Data Boutique, we registered interest from some websites open to sharing first-hand information on the marketplace and found great alignment of vision on these 4 reasons why it makes sense.
While we do not expect the majority of websites to embrace this philosophy, we still promote an open discussion about it.
Join the Project
That was it for this week!
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.
We are leaving off this post all implications of nonpublic, personally identifiable, or IP-protected data.