The Impact of ChatGPT on the Web Scraping Industry
Access to live web data is just the tip of the iceberg, advanced analytics and enterprise model push web data adoption
About Data Boutique
Data Boutique is a web-scraped data marketplace: Because the smartest way to get this data is to ask those who already collect it.
The Impact of ChatGPT on the Web Scraping Industry
November 2023 marks one year since the public release of ChatGPT. The model is trained on large-scale web-scraped data and can now access the internet in real-time.
Given the latest trends, what are the impacts on this data economy?
tl;dr: ChatGPT doesn’t get rid of the need for web data, but it speeds up how you get it, and how you get insights out of it.
Does ChatGPT solve the web data need?
Do we still need to extract data from the web when we can ask ChatGPT?
Unfortunately, while ChatGPT has access to a lot of data, it doesn’t have access to all the data you need, not in the format you need, and wouldn’t deliver it in the way you need for a large-scale web-scraping use case.
Think of applications like revenue optimization for hotels: Features like browse with Bing or the Kayak plugin do offer some relief to this, but all efforts hit a wall when trying to build something intrinsically more robust than search engine results.
When researching the fashion industry, the Shein discovery ChatGPT plugin offers some basic level results, but its primary purpose is to make you shop on Shein, not help craft your brand strategy.
When building your own LLM, or a market intel SaaS, you need data extraction from the web - and ChatGPT’s help can go a long way:
It can help with the extraction problem (data supply), by making it faster
It can help us once we have it (data demand) to make sense of it
Supplying the data
Without entering details, systematically extracting data from websites burns resources in two ways:
Time - writing the code, executing it, and checking the results for completeness absorbs FTEs pretty heavily, and it’s getting more and more complicated than it was just 5 years ago. The most innovative applications use generative AI to save time by making the writing of the code faster, assessing quality issues on the retrieved data for easier code maintenance, and reducing the time to market of the software to keep it in high reliability. A private version of the model (like the ChatGPT enterprise plan) can help build proprietary assets.
Money - websites often have costly anti-scrape measures that require costly anti-anti-scrape countermeasures. Little can be achieved on this side of the problem with generative AI. It can teach you how to fish, but it still remains a damn expensive sport.
Using the data (demand)
The use of AI in business intelligence has long been a feature that was explored by many platforms, yet with not exceptionally satisfying results.
Today ChatGPT Advanced Data Analysis feature (ex code interpreter) and plugins like Noteable allow you to get a lot done on this side:
You can upload your data (the web data you just collected or has someone collect for you) or connect with your database (maybe a better choice since limitations on upload barely hit the definition of “big data”)
Interact directly with data through textual prompts, which is very useful, especially on first-time onboarding of datasets and quality assurance and ingestion process design.
The current state of these technologies works better in interaction with established technology stacks of Data Lakes/Data Warehousing or Business Intelligence tools like PowerBI or Tableau and will likely serve as co-pilots rather than replace them.
The main advantages we have seen are:
Boost in analytics capacity for one-off analysis, like an M&A due diligence - traditionally incompatible with the timeline of building a custom BI project - allowing to process larger amounts of data way faster
Faster data onboarding on long-term projects, which, in addition to private versions of the model like ChatGPT Enterprise, allows the use of internal data and building proprietary models.
Using the data to train LLMs (more demand)
Finally, a word on the most data-hungry applications in town: LLMs. Maybe the strongest effect we have seen in web scraping was the rise of new LLMs inspired by ChatGPT.
Training models is a data-devouring activity, which for many models means scraping, scraping, and again scraping. Once trained, the models still need data to process, be refined, and (again) trained.
Since the launch of our marketplace, the share of AI applications in our user base has kept growing: From early-stage startups to established SaaS, new requests for high-frequency, high-volume, billion rows pricing/product/URL data feeds have only increased.
The need for data is so intense that they would be spending more time fixing scrapers than working on the AI, had they decided not to source the data elsewhere - as one user told me, “We are an AI company, not a web scraping company!”.
Conclusions
We have seen a growth in demand and supply, triggered as well by the advent of large language models and by the growth in stakes in the digital economy.
But web scraping is often kept under cover (“The first rule of web scraping is: you do not talk about web scraping” is the opening line of the webscraping subreddit), and it is still living its wild-west moment, where everyone is out on their own, making it difficult to measure and compare.
A lot has been done on the software side, but too little on its fuel (data). Join our project if you want to make web scraping more efficient.
About the Project
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.