It’s well known that OpenAI uses vast amounts of data, some copyrighted, from across the internet to create the remarkably human-like responses of ChatGPT. The legality of this practice continues to be debated, as evidenced by lawsuits from the New York Times and others. But how did OpenAI train Sora, its new video AI program?
YouTube CEO Neal Mohan told Bloomberg that if Sora used YouTube content it would clearly violate the platform’s terms of service. While not accusing OpenAI of doing so, Mohan said scraping or downloading YouTube videos or transcripts would infringe on those terms, which prohibit unauthorized use of its content.
Mohan was referring to longstanding questions around what data AI companies use to develop their models. Creators uploading to YouTube expect its terms of service to be followed, including third parties not taking their content.
A YouTube spokesperson confirmed that its terms prohibit scraping or downloading its content, without elaborating on Mohan’s comments. OpenAI did not immediately respond about whether it used YouTube videos for Sora.
Previously, OpenAI admitted that some copyrighted data was necessary to build its AI models. More recently, OpenAI CTO Mira Murati could not confirm what specific content was used for Sora when asked, including from YouTube. She said any data used was either publicly available or licensed.
With Google developing its own AI tools, parent company Alphabet has added incentive to ensure rivals don’t tap into YouTube’s trove of data. “Google wants that data for its own models,” said Igor Jablokov, CEO of AI startup Pyron. He expects more “walled gardens” around data access.
For example, Reddit has a $60 million annual deal for Google to use its content. Media firms like the Associated Press and Axel Springer also have OpenAI licensing deals allowing access. As the AI race accelerates, data itself is becoming a highly valued asset.