Problems When Disclosing How Data Is Collected
Recent news highlights how in-house data collection might not be taken well by users.
About Data Boutique
Data Boutique is a web-scraped data marketplace.
If you’re looking for web data, there is a high chance someone is already collecting it. Data Boutique makes it easier to buy web data from them.
Join our Platform to learn and interact about this project:
Zoom, GPTbot, and Google “fair use” of copyrighted data
Three pieces of news inherent to what we do at Data Boutique came out recently:
Videoconferencing app Zoom Terms of Service updates caused users’ resentment, as they feared their personal conversations might be (or might have been) used to train AI. Zoom later said users need to opt-in to the service, but nonetheless, it caused a heated debate;
OpenAI released the specifics of the GPTbot, a web crawler aimed at collecting data for LLMs. OpenAI states it allows websites to opt-out of being scraped, but the question remains on what the incentive would be for websites to stay in, since they would be feeding for free a paid service to Microsoft with no reward;
In an Australian court, Google is claiming a “fair use” of copyrighted data collection. Basically, Google admits unlawful crawling of copyrighted data, claiming it shouldn’t be considered unlawful. Like the OpenAI case, Google claims that websites can opt out of the service, triggering the same questions that arise for the GPTbot case (given that they can separate it from the Google bot aimed at the search engine).
While this is not the place to discuss the details of this news, I am leaving a link to a podcast I recommend by Nathaniel Whittemore covering Zoom and the OpenAI case (looking forward to a new one on Google).
Why this is relevant to us
What is interesting for our case is the rising disclosure of data collection methods for AI, driven by users’ debate, regulators, and court filings. In summary:
Public opinion and regulators are paying more attention to what data is used in AI and are increasingly skeptical about providing their own for free;
Companies are forced to move out of stealth and disclose their data collection operations;
Current solutions seem to be aligned on adding an opt-in/out feature, which suggests a transition of sourcing models from a “complete but unethical” to a “clean but partial” one, as we can expect many to opt-out.
The Data Market Question
While the Zoom case refers to collecting data on how users interact with their application, which cannot be collected in other manners (yet does not imply users are willing to adhere), the OpenAI and Google cases refer to collecting generic purpose data, and they represent just a tiny portion of the entire AI ecosystem using web-scraped data to feed their business models which is in the same situation.
In the eyes of the consumer, they stand not accused of the quality and innovation of their algorithms and models but of the way they source data.
There is an obvious decoupling of the business of collecting data from the business of using the data, especially since progress in technology made the tools for using this data available to everyone.
In our vision, this separation between data collection and usage, and the increasing attention from public opinion and regulators on the first one, sets the stage for the general adoption of data markets: Markets on data that exists regardless of its use cases, data whose collection methods need to adhere to accepted rules, and requires quality assurance to prevent selection bias on the applications that will be developed on top of it.
A data market is where data collection is brought to light, adopts common rules, performs professionally and reliably, and is offered at fair market prices.
The Value of Accountability
When AI projects keep data collection in-house, they are onboarding a commoditized portion of the value chain with an embedded responsibility on a critical aspect in today’s society: Where and how did you get the data? Users, public opinion, regulators, and investors will sooner or later pose that question.
Data marketplaces address this. By being accountable for this, they relieve data and AI projects from the burden of a time- and resource-consuming activity loaded with reputational and operational risks.
Although not all data needs can be addressed by marketplaces, why on earth would you embark on something that burns cash, takes time, brings on risks, and raises eyebrows when you could have found it in a store? Because data is critical to your project? That is the exact reason why you should not do it in-house.
Join the Project
That was it for this week!
Data Boutique is a community for sustainable, ethical, high-quality web data exchanges. You can browse the current catalog and add your request if a website is not listed. Saving datasets to your interest list will allow sellers to correctly size the demand for datasets and onboard the platform.
More on this project can be found on our Discord channels.
Thanks for reading and sharing this.