About Data Boutique
Data Boutique is a web-scraped data marketplace.
If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.
Join our Platform to learn and interact about this project:
Scrape or Buy? The Zalando Case Study
TL;DR
[Too Long, Didn’t Read]
Buying is the best choice, and the reason is mainly the labor cost linked to web scraping.
If you want the details, you can read further :)
Why Zalando
Data Boutique recently released the Zalando prices dataset. Let’s take the chance to run some maths on a real-world case study and see when it would make sense for a company to scrape or buy data.
Zalando is a very large website (1M+ products listed in each country): A goldmine of data for brands, retailers, investors (it’s listed in Germany under ZAL.DE), AI developers (that’s a huge real-life training set), and data scientists in general.
But it’s also challenging to scrape, as explained by
in this post:Assumptions and Deal-Breakers
Since many moving parts are at play, we need to set unbiased assumptions, comparing the most efficient case to date for in-house web scraping vs. the current market offer at Data Boutique, and deal-breaker questions, which would annihilate this entire discussion.
Scrape scenario assumptions:
In-house web scraping can be performed on existing infrastructure, and that infrastructure has already been paid off. No additional costs or capex for this;
Proxy/ scraping aiding tool costs - regardless of which - are set as per the public exercise by in the post mentioned above, and they are equal to 3 EURfor a full one-time scrape;
Hourly rates for web scraping professionals are set to an average of 30 EUR/hour. We know this is debatable, as it varies according to country, level of experience, and more.
Coding hours spent consider an average of hours per month. This is also debatable. The same task can take twice as much to two different developers, depending on their expertise (which would be counterbalanced by the hourly rates). We estimated an 8 hours of effort per month, as an average, between first-time writing and code maintenance, including manual tests to ensure the scrape was complete as well as writing the consistency test routines;
Coding hours are the same if we run the scrape daily, weekly, or once. This might not always be true, as for daily scrapes, the manual checks would increase, but as a general example, we can use this assumption.
Since many of you might have different situations, we run a variance analysis on these figures to see the impact.
Buy scenario assumptions
We will purchase the Zalando Germany prices dataset on sale at Data Boutique at today’s price.
Deal-breakers
Before we dive in into the numbers, we need to ask some deal-breaking questions and check if we have a real make-or-buy choice at all:
Utility: Does the ready-to-buy dataset contain all information we need? If the answer is no.. well, we don’t have a choice and must custom-build the scraper.
In this case, it’s a product-list-page price collection. Product level granularity, product categories, title, price, discount, and image.
Time: Do we have enough time to search for someone to scrape it and time for them to write the code, test it, run it, send it over to you, find data quality issues, re-iterate again, and hopefully have your result? If the answer is no… we don’t have a choice either, and we will have to buy.
Capacity: Do we have people that have time to look into the scraping or to hire someone and give specs? If the answer is no, again, we will have to buy.
Cost Structure for the Scrape Scenario
Fixed Costs: Per our assumptions, all infrastructure is paid off, so we’re left with the monthly hours multiplied by the hourly rate.
8 hrs X 30 EUR/ hour = 240 EUR/ month
Variable Costs: Proxy / Scraping aiding tools consumption: 3 EUR per run X runs per month (according to the real-life experiment of
in his analysis)3 X 1 = 3 EUR/month in case of a one-off data collection (or once per month, since we’re computing monthly costs)
3 X 4 = 12 EUR/month in case of a weekly collection (assuming no repeats for bad quality need to be considered).
3X30 = 90 EUR/month in case of a daily collection.
Cost Structure for the Buy Scenario
Fixed Costs: No fixed costs
Variable Costs: Pay-per-download price. As of today, 9.60 EUR per number of downloads
9.60 X 1 = 9.60 EUR/month for a one-off or once-per-month download
9.60 X 4 = 38.4 EUR/month for a weekly download (there are no failed downloads in this case as the data is always provided successful against quality checks)
9.60 X 30 = 288 EUR/month for a daily download
Results
As expected, running the we-scraping internally shows two components (fixed and variable costs), but the Buy scenario only shows one.
Interestingly, with this cost profile, the main component for in-house scraping is labor cost.
Even in daily scrapes, where variable costs are more present, still labor costs weights for more than 70%
When we compare the two scenarios, we have a clear winner: Even in the most challenging conditions (daily scrapes), buying is more convenient. In a one-off scenario, Buy beats Scrape by more than 25X.
Variance Analysis
A lot of assumptions here, we acknowledge that many things may vary, so let’s try to play around with numbers.
The Buy scenario remains the same since it involves a given price.
The Scrape scenario could vary in variable or fixed costs. Since variable costs account for a limited part, and if your spend was way higher than 3 (let’s say 100 EUR / run), I would recommend hiring
, but not variating it in this exercise.The fixed costs are those that have more impact and are more discretionary, so let’s try halving them (VA1) and increasing them (VA2).
Variance Analysis 1
In-house scraping is more efficient and burns half the monthly coding hours (1 hour per week) at the same hourly rate of 30 EUR / hour. Or, if you prefer, half of the hourly rate (15 EUR / hour) at the same number of hours (8).
Scraping is more effective in a daily collection, but as soon as we slow it down, buying is still (economically speaking) the best option. In the one-off case, by more than 12X.
Variance Analysis 2
Now let’s play the other way around. We were wrong; it took the developers longer hours, or we needed to hire a more experienced one with a higher hourly rate. Let’s use hourly rates of 60 EUR/hour instead of 30 EUR/hour.
Now doing a one-off scrape is 50X more expensive than buying.
Join the Project
That was it for this week!
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.