Using Historical Data: Basic Knowledge

Elements to consider when using historical data from web scraping

Jun 27, 2024

Elements to consider when using historical data from web scraping

white and black train toy — Photo by Markus Winkler on Unsplash

The Context

DataBoutique.com recently enabled historical data access, a long awaited feature that unlocks value both for data buyers and sellers.

Simply put, it allows one-click access to past data, offering different options in terms of cost and granularity. And for the benefit of many, keeping the same data-schemas adopted on current data collections.

When to use historical data

See the past (to predict the future)

The most common quesiton an historical dataset can answer, is “how were things in a specific point in time?” or “When did this thing start?”, and ideally capture trends that can help us have a clearer forecast for the future. “Is the price of a product growing?”, “Is the number of hotels in that reservation website rising?”

A variation on this is to simply check what happened on a particular date or who first did something. “Was this product cheaper on that day?” “Which website started the discounts first?” “Was this retailer selling this product also before that date?” And so on.

Websites like wayback-machine are inadequate to provide proper time-tracking of fast-moving websites, like e-commerce or reservation platforms, getting web scraped historical data is the only way to answer these questions.

Back-test hypotheses

We can take it one step further and look at the past to test hypotheses. “Is a heavy-discount practice on an e-commerce anticipatory of poor website performance?” “Was the competing retailer running discounts when my website suffered a decline in sales?” “Are those two brands synching their price-change strategies?”.

This approach is more complex than just “looking at the past,” as it often involves looking for a statistical correlation between two phenomena, higher granularity of the dataset, and longer timeframes for testing.

Longer timeframes imply that either you or someone in your organization has had enough foresight to start scraping that website two years before you needed it, or you find someone who is already (consistently and continuously) doing it (that’s what Data Boutique is for).

Train A.I. (and other fun stuff)

An even more advanced use case of historical data is to train automations: Once you’ve seen the past, once you have tested your hypotheses, you can train algorithms, AI, or other decision-making processes to take action when specific conditions happen. Stop discounting as soon as the competitors stop, dynamically adjust prices, automatically buy an item when price conditions are met, or raise a red flag when a distributor does something they are not supposed to.

Whatever you want to build, test it on the past, or (recommended) train AI to do that for you.

Things to consider

Here’s a list of essential elements to remember when approaching historical data. The topic is more complex than this, but we’ll start with the founding blocks:

History length

Arguably, the most basic element to consider is when the collection started. Now, with traditional datasets (financial transactions, stock prices, air temperature, etc.), we are used to seeing very long time series dating back decades.

In web scraping, the situation is different, to say the least. With few notable exceptions (data providers targeting a single website for years), finding datasets with less than a year or just a few months of history is not infrequent. This is part of web scraping: It costs money to keep collecting data from a website, and there are simply too many websites to choose from. While scraping today’s website content can be considered a commodity (it doesn’t really depend on who is doing it, as long as the quality is met), historical data is a differentiating factor.

Frequency

Historical data can be offered at different levels of granularity. One scan per month can be enough in some cases, while others require higher frequencies (weekly or daily).

The finer the granularity, the more information (and noise) can be found, but it is also heavier to manage. On Data Boutique, we offer three levels:

Monthly: One snapshot per month (12 a year) is ideal for simple research, long-term trend analysis, and initial exploration for finer work to be done on more granular frequencies (as it’s cheaper)
Weekly: One snapshot per week (52 a year) is approximately four times larger than the monthly one. It’s quite bulky but perfectly suitable for most back-testing and training purposes. Given its higher cost, we recommend trying a cheaper run with the monthly one before evaluating it as an effective way to assess its potential.
Daily: The most granular historical dataset we allow on the platform. This is for heavy-duty usage (like some revenue-estimating projects).

Quality Factors

What quality elements impact time series? We name a few:

Completeness: Are data points missing? Are there significant gaps in the collection? A continuous collection is preferable, but as experience tells us, web scraping has quite a bumpy pipeline (we all confide in AI-aided scrapers to fix that). Gaps in the collection, unfortunately, do happen. Data Boutique provides a completeness indicator designed precisely for this.
Point-in-time and gap-filling: Gap-filling is the technique of " filling the gap” in history by interpolating two data points. The opposite (leaving this as it is and not filling the gap or changing the content afterward) is point-in-time data. Data Boutique is committed to delivering data as close to the original format as possible, delivering point-in-time data. This ensures that buying historical data has the same result as buying it as it gets published.
Date-picking method: When creating a monthly or weekly time series, data providers often choose one specific date for their collection (e.g., every first/last day of the month/week). On Data Boutique, these dates are homogeneous and defined as the last day of the month/week (or the closest available).
Quality Assurance: Quality procedures (domain, completeness, consistency, and ground truth checks) on Data Boutique are the same as those applied to current data. Again, this is to provide a uniform service for buyers who get historical data vs. those who get the data over time as it gets published.

Final remarks

Using historical data can be enormously powerful but carries a heavier workload. The ones listed here are just some of the elements to consider.

As always, we encourage the community of developers and data providers to join the conversation on our channels (we have a friendly Discord server). We are happy that some data sellers activated this option (historical data is an opt-in feature), and that some users have already experimented with historical data purchases.

Historical data has also been included in data bundles (our cost estimation tool for large data packages), and we’ll soon add more features to play with.

That was all for this edition.

Thanks for reading,

Andrea

About Data Boutique

Data Boutique is the data marketplace for web scraping. We make buying and selling data faster and safer for everyone.