Data Boutique
Data Boutique is a web-scraped data marketplace.
If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.
Join our Discord channels to learn and interact about this project:
E-commerce Price Data
E-commerce price collection is among the most popular use cases of web scraping. Brands, retailers, advisors, and investors research products, prices, and discounts to analyze the market.
But web scraping is time-consuming: There are just too many sites. It would be extremely challenging - if not impossible - to scrape all the interesting ones, chasing their changes and anti-bot systems over time.
The good news is we are not alone. Many are looking at the same websites, and many scrape the exact same ones. The efficient way to do this is to share the scraping results in a common platform. That is what Data Boutique does: Providing a platform to exchange data.
In e-commerce, we have different data structures since prices can vary depending on industry, distribution, and geography.
Let’s see what data can be exchanged.
Granular Data
Granular data are the most detailed. They represent the output of the scraping activity after being cleaned and structured, with no aggregation.
These data have the same number of rows as those collected. If a website lists 100.000 products on a specific date, the file will have 100.000 rows.
Granular data have a different set of information depending on where the data is captured.
Prices on Product-List Page (PLP)
Granular data on prices collected from product-list page is very common. It can contain information on an entire website, even the largest, and be an enormous advantage in terms of cost when compared to the alternative of doing the scraping internally.
If the information listed on the product-list page (PLP) is sufficient, we recommend using this format.
The extraction cost of the PLP is significantly lower than the product-detail page (PDP). Especially for large multi-brand websites, it will be easier to find a seller providing this dataset.
Note: If you need product-detail information that is not in the PLP, but for a limited number of products (i.e. only for a specific brand), it is still convenient to buy a PLP dataset, and when needed scrape internally the last mile.
Prices on Product-Detail Page (PDP)
When product detail information is crucial, this dataset provides additional fields on product description, including product features and other properties, that will help analyze the data.
Having PDP scrapes of large websites (i.e., Amazon) will be rare, given the great cost connected to it. Requesting this level of detail for a restricted subset (i.e., Nike products on Amazon) will be easier.
Prices on Product-Detail Page (PDP) with Variants
For product categories where the price varies by size or other features, like fragrances, where the size of the bottle is chosen only in product-detail page (PDP), we have added the PDP with Variant, where data collection iterates on product variants to capture different prices.
In certain industries, this is inevitable, like cosmetics.
Note: if the website lists variants as separate products, the options could be having a PLP or a PDP, and not necessarily a PDP with variants, but only if the GTIN code of the product is visible, respectively, in PLP or PDP.
Derived Data
Unlike granular data, derived data embeds a form of transformation already in the file. This makes life easier if you look only into an aggregate figure (like the total number of products by Nike on Amazon vs. Adidas). It would help you a lot to have a file with some hundred rows instead of a super large file of some hundred thousand or million rows, making your excel spreadsheet super slow, if not crash.
Purchasing derived data is very useful and time-saving for most market analysis use cases.
Simple Aggregates
Simple aggregation of products by brand or category. These files provide fast access to product count, average and quartile distribution of prices, and average and quartile distribution of markdowns.
For example, with a Farfetch dataset, a PLP full file would have half a million rows, and a brand aggregate would have only about 5.000. That’s two orders of magnitude smaller. This might be a good alternative if you plan to have an aggregated analysis.
Time Series Aggregates
Time Series Aggregates add more value than simple aggregates: One file already calculates the changes between two or more moments: Growth of product count during a year, quarter, or month; average price change or discount difference. According to the specific dataset, these can be available in like-for-like or total metrics.
These are powerful datasets because some computations, like inflation calculations, are not easily obtainable on a normal excel spreadsheet and require data analytics skills.
Cross-Dimensional Aggregates
These are among the most advanced datasets: They allow comparison between different dimensions, like geography, websites, or merchants within a marketplace.
Since these require a dedicated post, we will discuss them separately.
Join the Project
Data Boutique is in pre-order mode: You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.