Discover more from Data Boutique
From Zero to a Fully Automated Web Data Pipeline
Step by step guide
About Data Boutique
Data Boutique is a web-scraped data marketplace.
If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.
Join our Platform to learn and interact about this project:
Zero to One
Data Boutique is about lowering barriers to data.
Our previous posts discussed the advantages of using it to start a project, build a Proof of Concept (POC), or get to an MVP.
Today we show how to get from Zero (one-time data access) to One (automating it to a continuous data flow for your application).
The latter part of this mail is a bit technical. You can skip it if it’s not your area of expertise. Just remember you can access data automatically if you are building large-scale data projects.
Zero: Your First Data Access
Data Boutique’s website allows users fast and cost-efficient access to data:
Free samples: each LIVE dataset provides a link to 10 records of the actual file that users can buy (it’s really ten records of that data, not random stuff that just looks like it);
Low marginal costs: Each file - a full collection of items from a website - is billed as a fraction of the cost it would take to collect it. This allows efficient first-time data access and easy-to-estimate costs in case you need regular updates (weekly or daily refreshes of the same data);
Instant access: The data is available as soon as the data is purchased. You can download it to your computer and start using it right away.
With simple steps, your data is ready to use.
If this single piece of data fits your need, you have no further obligation. We are happy to have backed your one-time project.
But if you are building a Data Lake, a Data Warehouse, a Business Intelligence solution, or an App, you will need to have the same data refreshed regularly.
Here’s what you need to do to achieve it.
One: Automating the process
To download the same data more than once, let’s say on a daily or weekly frequency, there are two parts of the process you need to automate:
Generating files automatically: You don’t want to enter the Data Boutique website every week and purchase a file each time, but you can tell Data Boutique to automatically deliver a new file at the desired frequency;
Copy those files where you need them: Your server, Google Drive, BigQuery, Amazon AWS S3, MS Azure, or where it suits you best.
1. Generating files automatically
This is one of the most powerful features of Data Boutique: You have the freedom to start, stop and change the frequency of the file delivery in total autonomy.
You can start by setting a slow frequency - let’s say monthly - as long as you’re not yet sure of your needs or have budget constraints, and then switch it to weekly or even daily when your road ahead is clearer.
You can stop and change your mind at any time, reframe, and set a different refresh rate.
To automate file production, you simply need to go to BUY DATA > My Files menu, locate which data you want to automate, and click either UPDATE or REFRESH, (update is for data that was purchased today, refresh is for older orders).
The same button allows you to change the frequency, or stop it, in case you change your mind. You will be billed only for the files that are delivered.
A pop-up menu is shown: You can change frequency, set an expiration date (“valid until” date), or turn the order off by changing it from “Active” to “Complete”.
The “valid until” date is useful because it will make the order automatically turn off when that date is reached, so you don’t have to remember to go to the website and turn it off.
2. Copy those files where you need them
When you have a recurring file delivery, you can either download each file manually under the BUY DATA > My Files menu > Show Files and then download (see example)
or you can automate all this and avoid entering the Data Boutique’w website.
We are analyzing 4 methods to achieve this. Pick whichever you like, or do how it works best for you.
Getting your AWS Credentials
Data Boutique files are on AWS S3. Each user is assigned their own AWS credentials.
With these credentials only you can access your files, so don’t share them if you don’t want others to use your data.
Locating the credentials:
Go to BUY DATA > My files and go to the bottom of the page.
You can see your AWS Access Key, and by clicking on the “Show Secret Key” you can see the AWS Secret key, which you’ll need to connect to S3.
AWS CLI command line
AWS CLI is a good fit if you are looking for a command-line solution.
How to get started:
Locate your Data Boutique AWS S3 credentials (see above)
Download and install the AWS CLI from this page
Configure AWS CLI with this command
aws configure [enter AWS access KEY and AWS secret keys when prompted]
aws s3 ls --recursive s3://databoutique.com/buyers/YourAWSAccessKey/
aws s3 cp fullfilepath .
That’s it. You can now play around with it.
AWS S3 API
Do you want to use an API? Get started with the AWS S3 Get Object API (use the above credentials to see all your files).
The process changes slightly depending on the programming language you choose, but the key elements to input remain the same:
AWS Access Key (see above for how to get it)
AWS Secret Key (see above for how to get it)
AWS S3 Region: EU-Central-1 (Frankfurt)
AWS S3 bucket: databoutique.com
Make.com (ex Integromat)
No-code fan? So are we! Here’s a step-by-step guide on how to use make.com:
Create a scenario
add an AWS S3 module
use a List Files, Get File action, or an API call (here is an example of how to configure Get File)
Configure an AWS S3 connection using the credentials of Data Boutique (see above where to find them). Remember the region: EU Frankfurt.
And you’re good to integrate this with the rest of your workflow.
Zapier fan? Configure Amazon S3 connection using Data Boutique AWS keys (remember the region EU Central 1)
You can build your Zap to move the S3 file where you need (in this example, to Google Drive whenever a new file is added or updated)
When configuring, pick the databoutique.com bucket.
And you’re all set.
Happy data crunching.
Join the Project
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.