AI company Runway has reportedly collected pirated YouTube videos and copyrighted movies without permission. 404 Media obtained internal spreadsheets that suggest the AI video generation startup used YouTube content from channels like Disney, Netflix, and Pixar, as well as popular media outlets, to train its Gen-3 model.
A purported former Runway employee told the paper that the company used spreadsheets to flag lists of wanted videos in its database, then used open-source proxy software to download them undetected and cover its tracks. One sheet listed simple keywords like astronaut, fairy and rainbow, with footnotes indicating whether the company had found corresponding high-quality videos for training. For example, the term “superhero” includes the note “lots of movie clips” (which it did).
Other notes show Runway flagging the Unreal Engine YouTube channel, filmmaker Josh Neuman, and Call of Duty fan pages as good sources of “high-motion” training videos.
“That spreadsheet channel was a company-wide effort to find good videos to use in building models,” the former employee told 404 Media. “This was then used as input to a large-scale web crawler, which downloaded every video from every channel, using proxies to avoid being blocked by Google.”
Runway
One spreadsheet listed nearly 4,000 YouTube channels and showed “recommended channels” from CBS New York, AMC Theatres, Pixar, Disney Plus, Disney CDs, and the Monterey Bay Aquarium (no AI model would be complete without an otter).
Runway also reportedly created a separate list of videos from piracy sites. The spreadsheet, titled “Non-YouTube Sources,” contains 14 links to sources such as an unauthorized online archive of Studio Ghibli films, anime and movie piracy sites, a fan site that displays Xbox game videos, and the anime streaming site kisscartoon.sh.
404 Media found that when they fed the names of popular YouTubers from the spreadsheet into the video generator, which they said was conclusive evidence that the company had used the training data, it spit out strikingly similar results. Crucially, feeding the same names into Runway’s older Gen-2 model, which was trained before the spreadsheet’s suspicious data, produced “irrelevant” results, such as generic men in suits. What’s more, when the media contacted Runway to inquire about why caricatures of YouTubers were appearing in the results, the AI tool stopped generating them altogether.
“We hope that by sharing this information, people will have a better understanding of the scale of these companies and what they do to make ‘cool’ videos,” a former employee told 404 Media.
Asked for comment, a YouTube representative referred Engadget to an April interview with Bloomberg in which CEO Neal Mohan said the video workouts were a “clear violation” of the company’s terms of service. “Our previous comments on this matter remain valid,” YouTube spokesman Jack Mason wrote to Engadget.
Runway did not respond to a request for comment at the time of publication.
At least some AI companies appear to be racing to standardize their tools and establish market leadership before users and courts become aware of how they’ve built them. Permission-based training through licensing agreements is one way — another tactic recently adopted by companies like OpenAI — but treating the entire internet, including copyrighted material, as subject to a cutthroat competition for profit and control is a much more dubious (if not illegal) proposition.
404 Media’s excellent coverage is well worth a read.