Discover more from Data Boutique
5 Reasons You Should Stop Web Scraping
An approach to web data as a service
About Data Boutique
Web data is an enabler, not the core.
Recent decades have seen the creation of countless applications that use web data, from generative AI to market intelligence, from search to dynamic pricing. Data is the new oil (they said 10+ years ago).
But let’s not be confused: Web data is the enabler for these apps, it is not the apps.
Data is an enabler, just like servers are. Yet nobody owns servers anymore: We access them “as a service”.
Data is an enabler, just like business software is, from CRM to analytics tools. 20 years ago, companies would write their own. Today, accessing them “as a service” is the best practice.
Web data is an enabler. Once it made sense to collect it internally. Now, web scraping has become a resource-eating, risk-exposing activity. The number of challenges a scraping developer has to go through, from chasing websites’ code dynamics, to circumventing anti-bot software, significantly raised the bar.
5 reasons to ditch web scraping and start buying it as a service (DaaS)
Should a company switch from in-house scraping for public web data to Data as a Service (DaaS)? Let’s see when it makes sense.
I use the term “public web data” as it implies that it is accessible and collectible by anyone with the proper tools, and it is not something exclusive to the company that will be using it.
Reason number one for switching to DaaS: Cold, hard cash.
Think of all the cost components of in-house web scraping - people involved, proxy providers, hardware, tools, or the annual fee from data farms when outsourced.
These costs are:
Very inelastic - once engaged, they’re not very sensitive to scaling down the frequency of the scraping
Hard to tell if you’re paying too much (there is no price benchmark for this)
By switching to DaaS, these costs get SUPER elastic, and you know you are paying fair market prices - the town things combined often end in a 100X cost reduction for low-frequency data refreshes.
How fast can an in-house team add 10 websites? And how fast can they stop those websites and start doing 10 others instead?
This is maybe the largest advantage of DaaS.
You can scale in frequency (going from monthly data collection to weekly or daily) can be handled on the fly and changed (scaled-up or scaled-down) anytime, as many times as necessary. There are no commitments.
You can scale in scope: If we want to add (or remove) 10 websites similar to the one we are collecting already, this also can be handled on the fly, with no technical knowledge or execution delay.
As a consequence, you can build very fast, very cheap PoC (Proof of Concept) at very little cost, and then scale them up to production right when needed.
3. Talent allocation
Spending time figuring out what line of code broke in your scraper might not be the best use of your talent’s time. Why? Because there are hundreds of other talents out there doing the same thing, for that exact same website.
It would be much more efficient to spend it on how to use this data, how to transform it, structure it, change domains, convert, and lookup values. All activities that were already there, even before, but they were understaffed.
If you free your talent’s time from activities that can be found on the market, and have it on stuff critical and differentiating, you get all hands on what’s really differentiating.
Keeping web-scraping internal? Be prepared for due diligence, maintain policies, adhere to regulations, and disclose procedures and logs. For those selling data to hedge funds and other regulated entities, this may not come as a surprise, but copyright and privacy lawsuits are getting frequent also for AI and SaaS, regarding their web scraping activities.
Again: Procedures, log, disclosure. Is this the best way to allocate your talent’s time?
Or, when you think this is not core to you, buy from a platform instead, and have this sorted.
5. Strategic Positioning
All this finally leads to the core question: Is your company a SaaS/AI company, or is it a web scraping company?
The alignment of all activities on your strategic positioning helps focus on what’s differentiating about your company. Just like it is not core to hold and maintain servers, and develop and maintain CRM software, maybe it’s not web scraping when this can be found on the market.
Is web scraping core instead? Then monetize it.
This industry has a lot of talented players. Many of which, having worked for years in web scraping, are just not yet ready to jump to DaaS. I understand that.
Are you more confident in your own data? Is your cost base competitive with what you find on Data Boutique? Is your data acquisition pipeline so strong you’d trust it more than what you’d find on a marketplace?
Fantastic. You should then monetize this capability, and sell on Data Boutique. It’s a win-win.
About the Project
That was it for this week!
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.