AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability
Yesterday, Meta’s AI Research Team announced Make-A-Video, a “state-of-the-art AI system that generates videos from text.”
Like he did for the Stable Diffusion data, Simon Willison created a Datasette browser to explore WebVid-10M, one of the two datasets used to train the video generation model, and quickly learned that all 10.7 million video clips were scraped from Shutterstock, watermarks and all.
In addition to the Shutterstock clips, Meta also used 10 million video clips from this 100M video dataset from Microsoft Research Asia. It’s not mentioned on their GitHub, but if you dig into the paper, you learn that every clip came from over 3 million YouTube videos.
So, in addition to a massive chunk of Shutterstock’s video collection, Meta is also using millions of YouTube videos collected by Microsoft to make its text-to-video AI.
Non-Commercial Use
The academic researchers who compiled the Shutterstock dataset acknowledged the copyright implications in their paper, writing, “The use of data collected for this study is authorised via the Intellectual Property Office’s Exceptions to Copyright for Non-Commercial Research and Private Study.”
But then Meta is using those academic non-commercial datasets to train a model, presumably for future commercial use in their products. Weird, right?
Not really. It’s become standard practice for technology companies working with AI to commercially use datasets and models collected and trained by non-commercial research entities like universities or non-profits.
In some cases, they’re directly funding that research.
For example, many people believe that Stability AI created the popular text-to-image AI generator Stable Diffusion, but they funded its development by the Machine Vision & Learning research group at the Ludwig Maximilian University of Munich. In their repo for the project, the LMU researchers thank Stability AI for the “generous compute donation” that made it possible.
The massive image-text caption datasets used to train Stable Diffusion, Google’s Imagen, and the text-to-image component of Make-A-Video weren’t made by Stability AI either. They all came from LAION, a small nonprofit organization registered in Germany. Stability AI directly funds LAION’s compute resources, as well.
Shifting Accountability
Why does this matter? Outsourcing the heavy lifting of data collection and model training to non-commercial entities allows corporations to avoid accountability and potential legal liability.
It’s currently unclear if training deep learning models on copyrighted material is a form of infringement, but it’s a harder case to make if the data was collected and trained in a non-commercial setting. One of the four factors of the “fair use” exception in U.S. copyright law is the purpose or character of the use. In their Fair Use Index, the U.S. Copyright Office writes:
“Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair.”
A federal court could find that the data collection and model training was infringing copyright, but because it was conducted by a university and a nonprofit, falls under fair use.
Meanwhile, a company like Stability AI would be free to commercialize that research in their own DreamStudio product, or however else they choose, taking credit for its success to raise a rumored $100M funding round at a valuation upwards of $1 billion, while shifting any questions around privacy or copyright onto the academic/nonprofit entities they funded.
This academic-to-commercial pipeline abstracts away ownership of data models from their practical applications, a kind of data laundering where vast amounts of information are ingested, manipulated, and frequently relicensed under an open-source license for commercial use.
Unforeseen Consequences
Years ago, like many people, I used to upload my photos to Flickr with a Creative Commons license that required attribution and allowed non-commercial use. Yahoo released a database of 100 million of those Creative Commons-licensed images for academic research, to help the burgeoning field of AI. Researchers at the University of Washington took 3.5 million of the Flickr photos with faces in them, over 670,000 people (including me), and released the MegaFace dataset, part of a research competition sponsored by Google and Intel.
I was happy to let people remix and reuse my photos for non-commercial use with attribution, but that’s not how they were used. Instead, academic researchers took the work of millions of people, stripped it of attribution against its license terms, and redistributed it to thousands of groups, including corporations, military agencies, and law enforcement.
In their analysis for Exposing.ai, Adam Harvey and Jules LaPlace summarized the impact of the project:
[The] MegaFace face recognition dataset exploited the good intentions of Flickr users and the Creative Commons license system to advance facial recognition technologies around the world by companies including Alibaba, Amazon, Google, CyberLink, IntelliVision, N-TechLab (FindFace.pro), Mitsubishi, Orion Star Technology, Philips, Samsung1, SenseTime, Sogou, Tencent, and Vision Semantics to name only a few. According to the press release from the University of Washington, “more than 300 research groups [were] working with MegaFace” as of 2016, including multiple law enforcement agencies.
That dataset was used to build the facial recognition AI models that now power surveillance tech companies like Clearview AI, in use by law enforcement agencies around the world, as well as the U.S. Army. The Chinese government has used it to train their surveillance systems. As the New York Times reported last year:
MegaFace has been downloaded more than 6,000 times by companies and government agencies around the world, according to a New York Times public records request. They included the U.S. defense contractor Northrop Grumman; In-Q-Tel, the investment arm of the Central Intelligence Agency; ByteDance, the parent company of the Chinese social media app TikTok; and the Chinese surveillance company Megvii.
The University of Washington eventually decommissioned the dataset and no longer distributes it. I don’t think any of those researchers, or even the people at Yahoo who decided to release the photos in the first place, ever foresaw how it would later be used.

They were motivated to push AI forward and didn’t consider the possible repercussions. They could have made inclusion into the dataset opt-in, but they didn’t, probably because it would’ve been complicated and the data wouldn’t have been nearly as useful. They could have enforced the license and restricted commercial use of the dataset, but they didn’t, probably because it would have been a lot of work and probably because it would have impacted their funding.
Asking for permission slows technological progress, but it’s hard to take back something you’ve unconditionally released into the world.
As I wrote about last month, I’m incredibly excited about these new AI art systems. The rate of progress is staggering, with three stunning announcements yesterday alone: aside from Meta’s Make-A-Video, there was also DreamFusion for text-to-3D synthesis and Phenaki, another text-to-video model capable of making long videos with prompts that change over time.
But I grapple with the ethics of how they were made and the lack of consent, attribution, or even an opt-out for their training data. Some are working on this, but I’m skeptical: once a model is trained on something, it’s nearly impossible for it to forget. (At least for now.)
Like with the artists, photographers, and other creators found in the 2.3 billion images that trained Stable Diffusion, I can’t help but wonder how the creators of those 3 million YouTube videos feel about Meta using their work to train their new model.
A mysterious voice is haunting American Airlines’ in-flight announcements and nobody knows how
Here’s a little mystery for you: there are multiple reports of a mysterious voice grunting, moaning, and groaning on American Airlines’ in-flight announcement systems, sometimes lasting the duration of the flight — and nobody knows who’s responsible or how they did it.
Actor/producer Emerson Collins was the first to post video, from his Denver flight on September 6:
Here’s an MP3 of the audio with just the groans, moans, and grunts, with some of the background noise filtered out.
This is the only video evidence so far, but Emerson is one of several people who have experienced this on multiple different American Airlines flights. This thread from JonNYC collected several different reports from airline employees and insiders, on both Airbus A321 and Boeing 737-800 planes.
Other people have reported similar experiences, always on American Airlines, going as far back as July. Every known incident has gone through the greater Los Angeles area (including Santa Ana) or Dallas-Fort Worth. Here are all the incidents I’ve seen so far, in chronological order:
- July – American Airlines, JFK to LAX. Bradley P. Allen wrote, “My wife and I experienced this during an AA flight in July. To be clear, it was just sounds like the moans and groans of someone in extreme pain. The crew said that it had happened before, and had no explanation. Occurred briefly 3 or 4 times early in the flight, then stopped.” (Additional flight details via the LA Times.)
- August 5 – American Airlines 117. JFK to LAX. Wendy Wanderman wrote, “It happened on my flight August 5 from JFK to LAX and it was an older A321 that I was on. It was Flight 117. There was flight crew that was on the same plane a couple days earlier and the same thing happened. It was funny and unsettling.”
- September 6 – American Airlines. Santa Ana, CA to Dallas-Fort Worth. Emerson Collins’ flight. “These sounds started over the intercom before takeoff and continued throughout the flight. They couldn’t stop it, and after landing still had no idea what it was… I filmed about fifteen minutes, then again during service. It was calmer for a while mid flight.”
- Mid-September – American Airlines, Airbus A320. Orlando, FL to Dallas-Fort Worth. Doug Boehner wrote, “This happened to me last week. It wasn’t the whole flight, but periodically weird phrases and sounds. Then a huge ‘oh yeah’ when we landed. We thought the pilot left his mic open.”
- September 18 – American Airlines 1631, Santa Ana, CA to Dallas-Fort Worth. Boeing 737-800. An anonymous report passed on by JonNYC, “Currently on AA1631 and someone keeps hacking into the PA and making moaning and screaming sounds 😨 the flight attendants are standing by their phones because it isn’t them and the captain just came on and told us they don’t think the flight systems are compromised so we will finish the flight to DFW. Sounded like a male voice and wouldn’t last more than 5-10 seconds before stopping. And has [intermittently] happened on and off all flight long.” (And here’s a second person on the same flight.)
Interestingly, JonNYC followed up with the person who reported the incident on September 18 and asked if it sounded like the same voice in the video. “Very very similar. Same voice! But ours was less aggressive. Although their volume might have been turned up more making it sound more aggressive. 100% positive same voice.“
Official Response
View from the Wing’s Gary Leff asked American Airlines about the issue, and their official response is that it’s a mechanical issue with the PA amplifier. The LA Times followed up on Saturday, with slightly more information:
“Our maintenance team thoroughly inspected the aircraft and the PA system and determined the sounds were caused by a mechanical issue with the PA amplifier, which raises the volume of the PA system when the engines are running,” said Sarah Jantz, a spokesperson for American.
Jantz said the P.A. systems are hardwired with no external access and no Wi-Fi component. The airline’s maintenance team is reviewing the additional reports. Jantz did not respond to questions about how many reports it has received and whether the reports are from different aircrafts.
This explanation feels incomplete to me. How can an amplifier malfunction broadcast what sounds like a human voice without external access? On multiple flights and aircraft? They seem to be saying the source is artificial, but has anyone heard artificial noise that sounds this human?
Why This Is So Bizarre
By nature, passenger announcement systems on planes are hardwired, closed systems, making them incredibly difficult to hack. Professional reverse engineer/hardware hacker/security analyst Andrew Tierney (aka Cybergibbons) dug up the Airbus 321 documents in this thread.
“And on the A321 documents we have, the passenger announcement system and interphone even have their own handsets. Can’t see how IFE or WiFi would bridge,” Tierney wrote. “Also struggling to see how anyone could pull a prank like this.”
This report found by aviation watchdog JonNYC, posted by a flight attendant on an internal American Airlines message board, points to some sort of as-yet-undiscovered remote exploit.
— 🇺🇦 JonNYC 🇺🇦 (@xJonNYC) September 20, 2022
We also know that, at least on Emerson Collins’ flight, there was no in-seat entertainment, eliminating that as a possible exploit vector.
Theories
So, how could this happen? There are a handful of theories, but they’re very speculative.
Medical Intercom
The first to emerge was this now-debunked theory came from “a former avionics guy” posting in r/aviation on Reddit:
The most likely culprit IMHO is the medical intercom. There are jacks mounted in the overhead bins at intervals down the full length of the airplane that have both receive, transmit and key controls. All somebody would need to do is plug a homemade dongle with a Bluetooth receiver into one of those, take a trip to the lav and start making noises into a paired mic.
The fact that the captain’s announcements are overriding (ducking) it but the flight attendants aren’t is also an indication it’s coming from that system.
If this was how it was done, there’s no reason the prankster would need to hide in the bathrooms: they could trigger a soundboard or prerecorded audio track from their seat.
However, this theory is likely a dead end. JonNYC reports that an anonymous insider confirmed they no longer exist on American Airlines flights. And even if they existed, the medical intercoms didn’t patch into the announcement system. They only allow flight crew to talk to medical staff on the ground.
someone says:
— 🇺🇦 JonNYC 🇺🇦 (@xJonNYC) September 24, 2022
“These don’t exist on AA. We use an app on our iPhone to contact medical personnel on the ground. No such port exists, not since the super80 and they were inop’d.”
Pre-Recorded Audio Message Bug
Another theory, also courtesy of JonNYC, is that there’s an issue with the pre-recorded audio messages (“PRAM”), which were replaced in the last 60 days, within the timeframe of all these incidents. Perhaps some test audio was added to the end of a message, maybe by an engineer who worked on it, and it’s accidentally playing that extra audio?
It's probably the PRAM… Pre-Recorded Announcement Machine.
— Mɪᴄʜᴀᴇʟ Tᴏᴇᴄᴋᴇʀ (@mtoecker) September 23, 2022
These have solid state storage, techs just load files they get from *somewhere*, test procedure for audio is less than 20 minutes to check it out, and it can be interrupted by inflight announcements.
Artificial Noise
Finally, some firmly believe that it’s not a human voice at all, but artificial noise or audio feedback filtered through the announcement system.
Nick Anderegg, an engineer with a background in linguistics and phonology, says it’s the results of “random signal passed through a system that extracts human voices.”
An amp malfunction that inputs the signal through algorithms meant to isolate the human voice. All the non-human aspects of the random signal will be stripped out, and the result will appear human. https://t.co/2bXYVCFs2l
— Nick Anderegg loudly supports human rights (@NickAnderegg) September 26, 2022
Anderegg points to a sound heard at the 1:20 mark in Emerson’s video, a “sweep across every frequency,” as evidence that American Airlines’ explanation is accurate.
The tone sweep is just a sign that it’s artificial. Random signals (i.e. interference), when passed through systems designed to isolate the human voice, will make them sound human. It’s attempting to extract a coherent signal where there is none, so it’s approximating one
— Nick Anderegg loudly supports human rights (@NickAnderegg) September 26, 2022
Personally, I struggle with this explanation. The wide variation of utterances heard during Emerson’s three-hour flight are so wildly different, from groans and grunts to moans and shouts, that it’s difficult to imagine it as anything else but human. It’s far from impossible, but I’d love to see anyone try to recreate these sounds with random noise or feedback.
Any Other Ideas?
Any other theories how this might be possible? I’d love to hear them, and I’ll keep this post updated. My favorite theory so far:
Flying hurts the clouds and their screams are picked by the PA system. Seems pretty obvious 🙄
— Horus First (@HorusFirst) September 23, 2022

Online Art Communities Begin Banning AI-Generated Images
As AI-generated art platforms like DALL-E 2, Midjourney, and Stable Diffusion explode in popularity, online communities devoted to sharing human-generated art are forced to make a decision: should AI art be allowed?

On Sunday, popular furry art community Fur Affinity announced that AI-generated art was not allowed because it “lacked artistic merit.” (In July, one AI furry porn generator was uploading one image every 40 seconds before it was banned.) Their new guidelines are very clear:
Content created by artificial intelligence is not allowed on Fur Affinity.
AI and machine learning applications (DALL-E, Craiyon) sample other artists’ work to create content. That content generated can reference hundreds, even thousands of pieces of work from other artists to create derivative images.
Our goal is to support artists and their content. We don’t believe it’s in our community’s best interests to allow AI generated content on the site.
Last year, the 27-year-old art/animation portal Newgrounds banned images made with Artbreeder, a tool for “breeding” GAN-generated art. Late last month, Newgrounds rewrote their guidelines to explicitly disallow images generated by new generation of AI art platforms:
AI-generated art is not allowed in the Art Portal. This includes using tools such as Midjourney, Dall-E, and Craiyon, in addition fractal generators and websites like ArtBreeder, where the user selects two images and they are combined into a new image via machine learning.
There are cases where some use of AI is ok, for example if you are primarily showcasing your character art but use an AI-generated background. In these cases, please note any elements where AI was used so that it is clear to users and moderators.
Tracing and coloring over AI-generated art is something best shared on your blog, as it is much like tracing over someone else’s art.
Bottom line: We want to keep the focus on art made by people and not have the Art Portal flooded with computer-generated art.
It’s not just long-running online communities: InkBlot is a budding art platform funded on Kickstarter in 2021 that went into open beta just this week. They’ve already taken a “no tolerance” policy against AI art, and updating their terms of service to exclude it.
Platforms that haven’t taken a stand are now facing public pressure to clarify their policies.
DeviantArt is one of the most popular online art communities, and increasingly, members are complaining that their feeds are getting flooded with AI-generated art. One of the most popular threads in their forums right now asks the staff to “combat AI art” by limiting daily uploads, either by segregating it under a special category or to ban it entirely.
ArtStation has also been quiet as AI-generated images grow in popularity there. “Trending on ArtStation” is one of the most popular prompts for AI art because of the particular aesthetic and quality of work found there, which nudges the AI to generate work scraped from it, leading to a future ouroboros where AI models will be trained on AI-generated art found there.
However you feel about the ethics of AI art, online art communities are facing a very real problem of scale: AI art can be created orders of magnitude faster than traditional human-made art. A powerful GPU can generate thousands of images an hour, even while you sleep.
Lexica, a search engine that solely indexed images from Stable Diffusion’s beta tests in Discord, has over 10 million images in it. It would take a lifetime to explore everything in it, a corpus made by a relatively small group of beta testers in a few weeks.
Left unchecked, it’s not hard to imagine AI art crowding out illustrations that took days or weeks for someone to make.
To keep their communities active, community admins and moderators will have to decide what to do with AI art: allow it, segregate it, or ban it entirely.