AI Depends Massively on Web Scraping

And web data sourcing is now a critical problem

Jul 07, 2023

About Data Boutique

Data Boutique is a web-scraped data marketplace.

If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.

Join our Platform to learn and interact about this project:

Join Data Boutique

AI Depends Massively on Web Scraping

Three major events occurred recently: Reddit API pricing change, Twitter API access change, and OpenAI’s lawsuits.

All of them have one common theme: The huge dependency of AI on web-scraped data.

And they also reveal a deeper truth: The sourcing of web-scraped data is currently a wild west, with unmeasured and unmanaged risks that range from legal to operational.

Once again, it is a reminder of how critical it is to establish a reliable and safe data market where future applications can be built.

Let’s have an overview of the current risks

Operational and Financial Risks: The Reddit Case

It is not the purpose of this blog to discuss news, so I’ll link to articles where you can get more in-depth analysis, like these articles on the NYT or TechCrunch.

In a nutshell, Reddit, which used to give free access via API to its content, changed the conditions of the free plan, forcing current users to pay. The main reason is that Reddit’s data was deeply used to train AI, but they received no compensation for it.

Long story short: Either you stop receiving all of that data, or you need to pay.

Operational Risks: The Twitter Case

Somehow a similar reaction at Twitter to limit its data being used to train AI. Here is more on the news and here on its developments. It seems like, in this case, we have more of an operational risk since Twitter does not want you to have bulk access.

https://business.twitter.com/en/blog/update-on-twitters-limited-usage.html

Legal Risks: The OpenAI case

What happened to OpenAI raises instead warnings on legal risks linked to widespread, unmonitored web scraping. The case will be interesting to follow, but from what is contained in the papers, it seems like a violation of privacy laws in one case and copyright law in another case.

What Does This Tell Us

Web-scraped data is a common but also risky practice, currently performed with little or no risk management.

At Data Boutique, we believe web-scraped data is a commodity and should be treated as such. We promote transparency and openness in data sourcing. This will bring a safer use of data, safer applications based on data, and a functioning data market, just like any other commodity.

Join the Project

That was it for this week!

Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.

More on this project can be found on our Discord channels.

Thanks for reading and sharing this.

Data Boutique

AI Depends Massively on Web Scraping

And web data sourcing is now a critical problem

About Data Boutique

AI Depends Massively on Web Scraping

Operational and Financial Risks: The Reddit Case

Operational Risks: The Twitter Case

Legal Risks: The OpenAI case

What Does This Tell Us

Join the Project

Ready for more?