Daily Management Review

Inside The Clandestine Big Tech Battle To Purchase AI Training Data


Inside The Clandestine Big Tech Battle To Purchase AI Training Data
Photobucket was the most popular image hosting website in the world when it peaked in the early 2000s. With 70 million members, it constituted over half of the online photo market in the United States and served as the media foundation for once-popular sites like Myspace and Friendster.
As per analytics tracker Similarweb, the current Photobucket user base is at just 2 million. However, the emergence of generative AI could revitalise it.
In an interview with Reuters, CEO Ted Leonard of the 40-person startup based in Edwards, Colorado, said that he is in negotiations with several tech firms to licence Photobucket's 13 billion images and videos for use in training generative AI models, which can generate original material in response to text cues.
According to him, he has talked about costs ranging from 5 cents to $1 per photo and more than $1 for videos, with prices greatly depending on the customer and the kind of material desired.
"More than his platform has, we've spoken to companies that say, 'we need way more,'" Leonard continued. One client specifically told him they required over a billion films.
"You scratch your head and say, where do you get that?"
Citing commercial confidentiality, Photobucket declined to reveal the identities of its potential buyers. The continuing talks, which haven't been made public before, indicate that the corporation may have access to information valued at billions of dollars and provide a window into the thriving data market that's emerging as generative AI technology becomes more and more dominant.
Samsung announced on Friday that its earnings expectations are higher than those of analysts.
Massive amounts of freely downloaded internet data were initially used by tech behemoths like Google, Meta, and Microsoft-backed OpenAI to train generative AI models like ChatGPT, which can simulate human creativity. They have argued that this is morally and legally acceptable, despite being sued for the practice by several copyright holders.
Simultaneously, these internet giants are covertly funding content that is concealed behind paywalls and login screens, creating a black market for everything from chat logs to long-forgotten private images from defunct social networking apps.
Edward Klaris of the legal firm Klaris Law stated that there is currently a rush to go after copyright holders who have private collections of content that cannot be scraped. The firm advises content owners on agreements worth tens of millions of dollars each to licence archives of images, videos, and books for artificial intelligence training.
In order to provide the first in-depth analysis of this emerging market, Reuters spoke with more than 30 people who were knowledgeable about AI data deals. These people included lawyers, consultants, and current and former executives at the companies involved. They also provided details about the kinds of content being purchased, the prices that were being offered, and the growing concerns about the possibility that personal data would be incorporated into AI models without the knowledge or consent of individuals.
Microsoft and Google referred Reuters to supplier standards of conduct that contain data-privacy restrictions; OpenAI, Google, Meta, Apple, and Amazon all declined to comment on particular data deals and talks for this piece.
If Google found a violation, it said it would "take immediate action, up to and including termination" of its supplier arrangement.
Numerous prominent market research organisations assert that they have not even commenced approximating the magnitude of the opaque AI data market, in which corporations frequently withhold agreements. Those that do, like Business Research Insights, estimate the market to be worth about $2.5 billion at the moment and predict it might reach up to $30 billion in the next ten years.
The data grab coincides with growing demand on creators of large-scale generative AI "foundation" models to account for the vast volumes of content they input into their systems—a process known as "training" that necessitates intense processing power and frequently takes months to finish.
Tech businesses claim that without the ability to leverage massive archives of freely scraped web page data, such those made available by the non-profit repository Common Crawl, which they refer to as "publicly available," the technology would be prohibitively expensive.
Despite this, their strategy has sparked a flood of copyright litigation and regulatory scrutiny, forcing publishers to incorporate technology to prevent scraping on their websites.
In response, developers of AI models have begun to manage risks and safeguard data supply chains by entering into agreements with content owners and by utilising the growing number of data brokers that have emerged to meet demand.
For example, companies like Meta, Google, Amazon, and Apple reached agreements with stock image provider Shutterstock in the months following ChatGPT's launch in late 2022 to use hundreds of millions of images, videos, and music files in its library for training, according to a person familiar with the agreements.
According to Shutterstock's Chief Financial Officer Jarrod Yahes, the deals with Big Tech companies initially ranged from $25 million to $50 million per, however most of them were eventually enlarged.
He said that in the last two months, there has been a new "flurry of activity" as smaller IT players have followed suit.
Yahes declined to discuss specific contracts. The terms of the other agreements, as well as the one with Apple, were not previously disclosed.
Freepik, a rival to Shutterstock, said Reuters that it had reached deals with two major internet firms to licence the bulk of its 200 million-photo database for two to four cents per image. CEO Joaquin Cuenca Abela stated that five more acquisitions of a similar nature were in the works, but he would not name the purchasers.
In addition, at least four news organizations—including The Associated Press and Axel Springer—have inked licencing agreements with OpenAI, an early Shutterstock customer. Owner of Reuters News, Thomson Reuters, said separately that it had agreements in place to licence news information for the purpose of training AI large language models; however, it did not provide any specifics.
A new sector of AI data firms is also developing, securing rights to real-world content such as podcasts, short-form videos, and interactions with digital assistants. These firms are also creating networks of temporary contract workers who create unique visuals and audio samples, similar to a gig economy for data akin to Uber.
CEO Daniela Braga of Seattle-based Defined.ai told Reuters that the company licences data to a variety of businesses, including Google, Meta, Apple, Amazon, and Microsoft.
Companies are typically prepared to pay $1 to $2 each photograph, $2 to $4 for short-form movie, and $100 to $300 per hour for longer films, according to Braga. Rates vary depending on the buyer and type of content. She also mentioned that the market rate for text is $0.001 per word.
Nude photos, which need to be handled with extreme care, sell for $5 to $7, according to her.
According to Braga, Defined.ai divides such profits with content suppliers. According to her, the company presents its datasets as "ethically sourced," meaning that personally identifying information is removed and agreement is obtained from the individuals whose data is used.
An entrepreneur from Brazil who works as a supplier for the company claimed to pay the owners of the images, podcasts, and medical data he sources between 20% and 30% of the overall transaction.
The supplier, who spoke on condition that his company's identity be withheld due to commercial sensitivity, claimed that the most expensive photographs in his portfolio are those that are used to train AI systems that filter content like graphic violence prohibited by the tech giants.
In order to comply with such demands, he sources photos of crime scenes, violent conflicts, and surgical procedures primarily from law enforcement, independent photojournalists, and medical students, respectively. These photos are frequently taken in South America and Africa, where it is more customary to distribute graphic photographs, he said.
He claimed to have received photos from independent photographers in Gaza since the conflict there began in October, along with some from Israel when hostilities first broke out.
He said, "The images are disturbing to untrained eyes, so his company hires nurses accustomed to seeing violent injuries to anonymize and annotate the images."
Many of the industry players questioned said that while licencing could address some legal and ethical concerns, bringing back the archives of long-gone internet brands like Photobucket to feed the newest AI models presents other challenges, especially with regard to user privacy.
AI systems have been observed reproducing perfect replicas of their training material, spewing out text from New York Times stories, photographs of actual individuals, and the Getty Images watermark, to name a few examples. This opens a new tab. Thus, without notification or express authorization, a person's personal images or intimate thoughts from decades ago may end up in generative AI outputs.
Citing an October modification to the terms of service that gives the company the "unrestricted right" to sell any submitted content for the purpose of training AI systems, Photobucket CEO Leonard argues that he is well inside the law. Rather of selling advertisements, he considers licencing data.
"We need to pay our bills, and this could give us the ability to continue to support free accounts," he stated.
According to Braga of Defined.ai, she prefers to source social media photographs from influencers who take them since they have a stronger claim to licencing rights than "platform" firms like Photobucket.
"I would find it very risky," Braga said of platform content. "If there's some AI that generates something that resembles a picture of someone who never approved that, that's a problem."
There are more systems besides Photobucket that support licencing. Last month, Automattic, the parent company of Tumblr, said that it was sharing content with "selected AI companies."
According to Reuters, Reddit and Google reached an agreement in February wherein Reddit's content will be made available for use in training Google's AI models.
Reddit revealed and opened a new tab ahead of its March IPO that the Federal Trade Commission is looking into its data-licensing business and that it might violate new laws pertaining to intellectual property and privacy.
The FTC declined to comment on the Reddit investigation or to say whether it was investigating other deals involving training data. In February, the FTC issued a warning to businesses about retrospectively altering terms of service for AI usage.