Scrape Or Buy? The Zalando Case Study

Cost of Web Data in a Real-Life Example

Jul 14, 2023

About Data Boutique

Data Boutique is a web-scraped data marketplace.

If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.

Join our Platform to learn and interact about this project:

Join Data Boutique

Scrape or Buy? The Zalando Case Study

TL;DR

[Too Long, Didn’t Read]

Buying is the best choice, and the reason is mainly the labor cost linked to web scraping.

If you want the details, you can read further :)

Why Zalando

Data Boutique recently released the Zalando prices dataset. Let’s take the chance to run some maths on a real-world case study and see when it would make sense for a company to scrape or buy data.

Zalando is a very large website (1M+ products listed in each country): A goldmine of data for brands, retailers, investors (it’s listed in Germany under ZAL.DE), AI developers (that’s a huge real-life training set), and data scientists in general.

But it’s also challenging to scrape, as explained by

Pierluigi Vinciguerra

in this post:

The Web Scraping Club

THE LAB #22 - Scraping Akamai protected websites

July is the Smartproxy month on The Web Scraping Club. For the whole month, following this link, you can use the discount code SPECIALCLUB and get a massive 50% off on any proxy and scraper subscription. Zalando: a bit of context If you’re living in Europe, probably…

2 years ago · Pierluigi Vinciguerra

Assumptions and Deal-Breakers

Since many moving parts are at play, we need to set unbiased assumptions, comparing the most efficient case to date for in-house web scraping vs. the current market offer at Data Boutique, and deal-breaker questions, which would annihilate this entire discussion.

Scrape scenario assumptions:

In-house web scraping can be performed on existing infrastructure, and that infrastructure has already been paid off. No additional costs or capex for this;
Proxy/ scraping aiding tool costs - regardless of which - are set as per the public exercise by
Pierluigi Vinciguerra
in the post mentioned above, and they are equal to 3 EURfor a full one-time scrape;
Hourly rates for web scraping professionals are set to an average of 30 EUR/hour. We know this is debatable, as it varies according to country, level of experience, and more.
Coding hours spent consider an average of hours per month. This is also debatable. The same task can take twice as much to two different developers, depending on their expertise (which would be counterbalanced by the hourly rates). We estimated an 8 hours of effort per month, as an average, between first-time writing and code maintenance, including manual tests to ensure the scrape was complete as well as writing the consistency test routines;
Coding hours are the same if we run the scrape daily, weekly, or once. This might not always be true, as for daily scrapes, the manual checks would increase, but as a general example, we can use this assumption.

Since many of you might have different situations, we run a variance analysis on these figures to see the impact.

Buy scenario assumptions

We will purchase the Zalando Germany prices dataset on sale at Data Boutique at today’s price.

Deal-breakers

Before we dive in into the numbers, we need to ask some deal-breaking questions and check if we have a real make-or-buy choice at all:

Utility: Does the ready-to-buy dataset contain all information we need? If the answer is no.. well, we don’t have a choice and must custom-build the scraper.
In this case, it’s a product-list-page price collection. Product level granularity, product categories, title, price, discount, and image.
Time: Do we have enough time to search for someone to scrape it and time for them to write the code, test it, run it, send it over to you, find data quality issues, re-iterate again, and hopefully have your result? If the answer is no… we don’t have a choice either, and we will have to buy.
Capacity: Do we have people that have time to look into the scraping or to hire someone and give specs? If the answer is no, again, we will have to buy.

Cost Structure for the Scrape Scenario

Fixed Costs: Per our assumptions, all infrastructure is paid off, so we’re left with the monthly hours multiplied by the hourly rate.

8 hrs X 30 EUR/ hour = 240 EUR/ month

Variable Costs: Proxy / Scraping aiding tools consumption: 3 EUR per run X runs per month (according to the real-life experiment of

Pierluigi Vinciguerra

in his

The Web Scraping Club

analysis)

3 X 1 = 3 EUR/month in case of a one-off data collection (or once per month, since we’re computing monthly costs)
3 X 4 = 12 EUR/month in case of a weekly collection (assuming no repeats for bad quality need to be considered).
3X30 = 90 EUR/month in case of a daily collection.

Cost Structure for the Buy Scenario

Fixed Costs: No fixed costs

Variable Costs: Pay-per-download price. As of today, 9.60 EUR per number of downloads

9.60 X 1 = 9.60 EUR/month for a one-off or once-per-month download
9.60 X 4 = 38.4 EUR/month for a weekly download (there are no failed downloads in this case as the data is always provided successful against quality checks)
9.60 X 30 = 288 EUR/month for a daily download

Results

As expected, running the we-scraping internally shows two components (fixed and variable costs), but the Buy scenario only shows one.

Interestingly, with this cost profile, the main component for in-house scraping is labor cost.

Even in daily scrapes, where variable costs are more present, still labor costs weights for more than 70%

In-house web scraping cost profile (fixed and variable costs) for Zalando, given the assumptions

Buy scenario cost profile (all variable) for Zalando on Data Boutique

When we compare the two scenarios, we have a clear winner: Even in the most challenging conditions (daily scrapes), buying is more convenient. In a one-off scenario, Buy beats Scrape by more than 25X.

Variance Analysis

A lot of assumptions here, we acknowledge that many things may vary, so let’s try to play around with numbers.

The Buy scenario remains the same since it involves a given price.

The Scrape scenario could vary in variable or fixed costs. Since variable costs account for a limited part, and if your spend was way higher than 3 (let’s say 100 EUR / run), I would recommend hiring

Pierluigi Vinciguerra

, but not variating it in this exercise.

The fixed costs are those that have more impact and are more discretionary, so let’s try halving them (VA1) and increasing them (VA2).

Variance Analysis 1

In-house scraping is more efficient and burns half the monthly coding hours (1 hour per week) at the same hourly rate of 30 EUR / hour. Or, if you prefer, half of the hourly rate (15 EUR / hour) at the same number of hours (8).

Scraping is more effective in a daily collection, but as soon as we slow it down, buying is still (economically speaking) the best option. In the one-off case, by more than 12X.

Variance Analysis 2

Now let’s play the other way around. We were wrong; it took the developers longer hours, or we needed to hire a more experienced one with a higher hourly rate. Let’s use hourly rates of 60 EUR/hour instead of 30 EUR/hour.

Now doing a one-off scrape is 50X more expensive than buying.

Join the Project

That was it for this week!

Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.

More on this project can be found on our Discord channels.

Thanks for reading and sharing this.