The Bitcoin Whitepaper Is Hidden in Every Modern Copy of macOS

While trying to fix my printer today, I discovered that a PDF copy of Satoshi Nakamoto’s Bitcoin whitepaper apparently shipped with every copy of macOS since Mojave in 2018.

I’ve asked over a dozen Mac-using friends to confirm, and it was there for every one of them. The file is found in every version of macOS from Mojave (10.14.0) to the current version, Ventura (13.3), but isn’t in High Sierra (10.13) or earlier. Update: As confirmed by 9to5Mac, it was removed in macOS Ventura 13.4 beta 3.

See for Yourself

If you’re on a Mac, open a Terminal and type the following command:

open /System/Library/Image\ Capture/Devices/VirtualScanner.app/Contents/Resources/simpledoc.pdf

If you’re on macOS 10.14 or later, the Bitcoin PDF should immediately open in Preview.

(If you’re not comfortable with Terminal, open Finder and click on Macintosh HD, then open the System→Library→Image Capture→Devices folder. Control-click on VirtualScanner.app and Show Package Contents, open the Contents→Resources folder inside, then open simpledoc.pdf.)

In the Image Capture utility, the Bitcoin whitepaper is used as a sample document for a device called “Virtual Scanner II,” which is either hidden or not installed for everyone by default. It’s not clear why it’s hidden for some or what exactly it’s used for, but Reid Beels suggested it may power the “Import from iPhone” feature.

In Image Capture, select the “Virtual Scanner II” device if it exists, and in the Details, set the Media to “Document” and Media DPI to “72 DPI.” You should see the preview of the first page of the Bitcoin paper.

Screenshot of Image Capture utility with the Virtual Scanner II device selected, previewing the first page of the Bitcoin whitepaper

But Why

Of all the documents in the world, why was the Bitcoin whitepaper chosen? Is there a secret Bitcoin maxi working at Apple? The filename is “simpledoc.pdf” and it’s only 184 KB. Maybe it was just a convenient, lightweight multipage PDF for testing purposes, never meant to be seen by end users.

There’s virtually nothing about this online. As of this moment, there are only a couple references to “Virtual Driver II” or the whitepaper file in Google results. Namely, this Twitter thread from designer Joshua Dickens in November 2020, who also spotted the whitepaper PDF, inspiring this Apple Community post in April 2021. And that’s it!

One other oddity: there’s a file called cover.jpg in the Resources folder used for testing the Photo media type, a 2,634×3,916 JPEG photo of a sign taken on Treasure Island in the San Francisco Bay. There’s no EXIF metadata in the file, but photographer Thomas Hawk identified it as the location of a nearly identical photo he shot in 2008.

Photo of a weathered hand-painted "6" sign and a sticker reading "Warning! Alarm System" on blue painted wood

If you know anything more — about how or why the Bitcoin paper ended up in macOS or what Virtual Scanner II is for — get in touch or leave a comment. (Anonymity guaranteed!)

Update: A little bird tells me that someone internally filed it as an issue nearly a year ago, assigned to the same engineer who put the PDF there in the first place, and that person hasn’t taken action or commented on the issue since. They’ve indicated it will likely be removed in future versions.

Update (April 26): As confirmed by 9to5Mac, it was removed in macOS Ventura 13.4 beta 3.

Pirating the Oscars 2023: The Final Curtain Call

It’s Oscar night! Which means I’m curled up on my couch, watching the ceremony and doing data entry, updating my spreadsheet tracking the illicit distribution of Oscar-nominated films online.

The results are in, and once again, nearly every nominee leaked online in HD quality before the broadcast. All but one of this year’s 30 nominated films leaked online — everything except Avatar: The Way of Water.

But not a single screener for a nominated film leaked by Oscar night — for the first time in the 20 years I’ve been tracking it.

What Happened?

For the first five years of the project, every year from 2003 to 2007, over 80% of screeners for nominated films made their way online. And now, not one screener leaked.

If you’ve read my past reports, you’ll know this is the culmination of a long-standing trend.

Oscar voters still get access to screeners for every nominated film, now entirely via streaming. But they typically get access to screeners after other high-quality sources for the films have appeared online: typically from other streaming services or on-demand rentals.

This is a huge difference from 20 years ago. Back then, screeners were highly-prized because they were often the only way to watch Oscar-nominated films outside of a theater. Theatrical release windows were longer, and it could take months for nominees to get a retail release.

But over time, things changed. The MPAA, often at the behest of Academy voters, was committed to the DVD format well into the 2010s, which became increasingly undesirable as 1080p and 4K sources became far more valuable than 480p resolution.

A shift from theaters to streaming meant more audiences demanded seeing movies at home, shrinking the window from theatrical release to on-demand streaming and rentals. Then the pandemic put the nail in the screener’s coffin, as people stayed home.

You can see this trend play out in the chart below, which shows the percentage of nominated films that leaked online as screeners, compared to the percentage that leaked in any other high-quality format.

In last year’s analysis, I wondered if the time between theatrical release and the first high-quality leak online would start to increase again, as more movies return to theaters and studios experimented with returning to longer windows. That appears to have happened, as the chart below shows, but there may be another contributing factor.

Last December, Torrentfreak reported on the notable lack of screener leaks, mentioning rumors of a bust that may have taken down EVO, the scene release group responsible for the majority of screener leaks in recent years. (Update: Three days after the Oscars aired, those rumors were confirmed. Portuguese authorities arrested EVO’s leaders in November 2022.)

Regardless of the reasons, it seems clear that no release group got access to the Academy Screening Room, where voters can access every screener for streaming, or perhaps the risk of getting caught outweighed the possible return.

Closing the Curtain

In 2004, I started this project to demonstrate how screener piracy was far more widespread than the Academy believed, and I kept tracking it to see if anything the Academy did would ever stop scene release groups from leaking screeners.

In the process, this data ended up being a reflection of changes in how we consume movies: changing media formats and increasing resolution, the shift to streaming, and shrinking release windows from theaters to streaming.

I didn’t think there was anything the MPAA could do to stop screeners, and ultimately, there wasn’t. The world changed around them and made screeners largely worthless. The Oscar screener appears to be dead and buried for good, but the piracy scene lives on.

And with that, it seems like a good place to wrap this project up. The spreadsheet has all the source data, 21 years of it, with multiple sheets for statistics, charts, and methodology. Let me know if you make any interesting visualizations with it.

Thanks for following along over the years. Ahoy! 🏴‍☠️🍿

Lost Media: Finding Bill Clinton’s “Boxers or Briefs” MTV Moment

Yesterday, I saw a tweet that seemed so obviously wrong, it made me wonder if it was just clickbait.

But after digging through Google, YouTube, Vimeo, DailyMotion, C-SPAN, and the Twitter archives myself, it seemed to be true: an iconic moment of ’90s political pop culture appeared to be completely missing from the internet.

Boxers or Briefs

If you were alive in the early ’90s, there’s a good chance you remember this moment.

During MTV’s “Choose or Lose” campaign coverage in the early ’90s, Bill Clinton promised to return to MTV if elected by the channel’s young voters. As promised, a little over a year into his first term, he appeared on MTV’s Enough is Enough on April 19, 1994, a town hall-style forum with 200 16- to 20-year-olds focused on violence in America, and particularly the 1994 crime bill being debated at the time.

Toward the end of the 90 minute broadcast, during a series of rapid-fire audience questions, 17-year-old Tisha Thompson asked a question that seemed to surprise and embarrass Clinton:

Q. Mr. President, the world’s dying to know, is it boxers or briefs? [Laughter] [Applause]

Clinton: Usually briefs. [Laughter] I can’t believe she did that.

That question got a ridiculously outsized amount of attention at the time. The Washington Post called him the “Commander In Briefs.” It was covered in the New York Times, Baltimore Sun, and countless others. It was the subject of late-night talk show monologues, and Clinton himself joked about it at the White House Correspondent’s Dinner later that week.

Over the following years, the “boxers or briefs” question became the “Free Bird” of the campaign trail, posed to Newt Gingrich (“that is a very stupid question, and it’s stupid for you to ask that question”), Bernie Sanders (“briefs”), and then-candidate Barack Obama: “I don’t answer those humiliating questions. But whichever one it is, I look good in ’em!”

Nearly 30 years later, the original clip is shockingly hard to find online. Someone on Reddit linked to a version of the video on YouTube, but the account was terminated. C-SPAN has a different clip from the show, as well as a searchable transcript, but not the clip itself.

As of right now, before I publish this post, it’s extremely hard to find online — but not impossible, because I found it, and here it is.

How I Found It

Among its voluminous archives of web pages, books, and other media, the Internet Archive stores a huge number of U.S. TV news videos, clips from over 2,470,000 shows since 2009. You can search the closed captions, and view short video clips from the results.

Their search engine is a little quirky, but more than good enough to find several news talk shows who rebroadcast the clip over the last few years, typically to poke fun at Bill Clinton. I searched for the exact quoted phrases from the original interview, and found 14 clips that mentioned it from shows like Hardball with Chris Matthews, Tucker Carlson Tonight, and The Situation Room with Wolf Blitzer.

Only one of the clips included the full question and answer, and didn’t overlay any graphics on the video, from an episode of Up w/Steve Kornicki on March 15, 2014.

The Internet Archive will let you stream the short clips, but there’s no option to download the video itself, and unfortunately, it frequently won’t show the exactly the moment you’re searching for. (This is probably an issue with alignment of closed captions and source videos.)

That said, you can edit the timestamp in the video URL. Every video result page has a URL with something like “/start/2190/end/2250” in the URL. This is the start and end timestamp in seconds, which you can adjust manually. (There appears to be a three-minute limit to clip length, so if you get an error, make sure you’re not requesting longer than that.)

Once you’ve found what you like, you can download it using Chrome Dev Tools:

  1. First, start and then pause the video.
  2. In Chrome, use Option-Command-C to open the element inspector in Chrome Dev Tools.
  3. Highlight and click on the paused video.
  4. In the Chrome Dev Tools window, right-click on the <video> tag and choose Open In New Tab.
  5. The video will now be in a dedicated tab. Hit Ctrl-S (Command-S on Mac) to save the video, or select “Save Page As…” from the File menu.

Or in Firefox, it’s much easier: just hold shift and right-click to “Save Video.”

After that, I just cropped the clip and uploaded it to my site, and YouTube for good measure. If you have a better quality version, please send it to me.

That’s it! The Internet Archive is an amazing and under-utilized resource, poorly indexed in Google, but absolutely filled with incredible things.

Screenshot of Up w/Steve Kornicki broadcast from March 15, 2014

Updates

On May 9, 2023, Paramount announced it was shutting down MTV News permanently, 36 years after it was started. This led to a bunch of tributes and retrospectives, and many of them link to this blog post or the copy of the video I put up on YouTube.

The Hollywood Reporter’s oral history of MTV News talked to Doug Herzog, who was MTV’s first News Director and went on to become president of Viacom Music and Entertainment Group, overseeing all of MTV, VH1, and Comedy Central, among others. He casually dropped this bomb, which I don’t think has ever been reported anywhere:

HERZOG: It’s Choose or Lose — which won a Peabody in 1992 — that ultimately led to, you know, Bill Clinton coming on MTV and talking about “boxers or briefs.” That question was planted by MTV, by the way.

The young woman who asked the question, Tisha Thompson, worked at MTV News, so that tracks! I asked her if she was already an intern or employee at MTV News when she asked the question, which seems likely. I’ll update this post if I get a response.

Invasive Diffusion: How one unwilling illustrator found herself turned into an AI model

Last weekend, Hollie Mengert woke up to an email pointing her to a Reddit thread, the first of several messages from friends and fans, informing the Los Angeles-based illustrator and character designer that she was now an AI model.

The day before, a Redditor named MysteryInc152 posted on the Stable Diffusion subreddit, “2D illustration Styles are scarce on Stable Diffusion, so I created a DreamBooth model inspired by Hollie Mengert’s work.”

Using 32 of her illustrations, MysteryInc152 fine-tuned Stable Diffusion to recreate Hollie Mengert’s style. He then released the checkpoint under an open license for anyone to use. The model uses her name as the identifier for prompts: “illustration of a princess in the forest, holliemengert artstyle,” for example.

Artwork by Hollie Mengert (left) vs. images generated with Stable Diffusion DreamBooth in her style (right)

The post sparked a debate in the comments about the ethics of fine-tuning an AI on the work of a specific living artist, even as new fine-tuned models are posted daily. The most-upvoted comment asked, “Whether it’s legal or not, how do you think this artist feels now that thousands of people can now copy her style of works almost exactly?”

Great question! How did Hollie Mengert feel about her art being used in this way, and what did MysteryInc152 think about the explosive reaction to it? I spoke to both of them to find out — but first, I wanted to understand more about how DreamBooth is changing generative image AI.


Since its release in late August, I’ve written about the creative potential and complex ethical and legal debates unleashed by the open-source release of Stable Diffusion, explored the billions of images it was trained on, and talked about the data laundering that shields corporations like Stability AI from accountability.

By now, we’ve all heard stories of artists who have unwillingly found their work used to train generative AI models, the frustration of being turned into a popular prompt for people to mimic you, or how Stable Diffusion was being used to generate pornographic images of celebrities.

But since its release, Stable Diffusion could really only depict the artists, celebrities, and other notable people who were popular enough to be well-represented in the model training data. Simply put, a diffusion model can’t generate images with subjects and styles that it hasn’t seen very much.


When Stable Diffusion was first released, I tried to generate images of myself, but even though there are a bunch of photos of me online, there weren’t enough for the model to understand what I looked like.

Real photos of me (left) vs. Stable Diffusion output for the prompt “portrait of andy baio” (right)

That’s true of even some famous actors and characters: while it can make a spot-on Mickey Mouse or Charlize Theron, it really struggles with Garfield and Danny DeVito. It knows that Garfield’s an orange cartoon cat and Danny DeVito’s general features and body shape, but not well enough to recognizably render either of them.

On August 26, Google AI announced DreamBooth, a technique for introducing new subjects to a pretrained text-to-image diffusion model, training it with as little as 3-5 images of a person, object, or style.

Google’s researchers didn’t release any code, citing the potential “societal impact” risk that “malicious parties might try to use such images to mislead viewers.”

Nonetheless, 11 days later, an AWS AI engineer released the first public implementation of DreamBooth using Stable Diffusion, open-source and available to everyone. Since then, there have been several dramatic optimizations in speed, usability, and memory requirements, making it extremely accessible to fine-tune it on multiple subjects quickly and easily.


Yesterday, I used a simple YouTube tutorial and a popular Google Colab notebook to fine-tune Stable Diffusion on 30 cropped 512×512 photos of me. The entire process, start to finish, took about 20 minutes and cost me about $0.40. (You can do it for free but it takes 2-3 times as long, so I paid for a faster Colab Pro GPU.)

The result felt like I opened a door to the multiverse, like remaking that scene from Everything Everywhere All at Once, but with me instead of Michelle Yeoh.

Sample generations of me as a viking, anime, stained glass, vaporwave, Pixar character, Dali/Magritte painting, Greek statue, muppet, and Captain America

Frankly, it was shocking how little effort it took, how cheap it was, and how immediately fun the results were to play with. Unsurprisingly, a bunch of startups have popped up to make it even easier to DreamBooth yourself, including Astria, Avatar AI, and ProfilePicture.ai.

But, of course, there’s nothing stopping you from using DreamBooth on someone, or something, else.


I talked to Hollie Mengert about her experience last week. “My initial reaction was that it felt invasive that my name was on this tool, I didn’t know anything about it and wasn’t asked about it,” she said. “If I had been asked if they could do this, I wouldn’t have said yes.”

She couldn’t have granted permission to use all the images, even if she wanted to. “I noticed a lot of images that were fed to the AI were things that I did for clients like Disney and Penguin Random House. They paid me to make those images for them and they now own those images. I never post those images without their permission, and nobody else should be able to use them without their permission either. So even if he had asked me and said, can I use these? I couldn’t have told him yes to those.”

She had concerns that the fine-tuned model was associated with her name, in part because it didn’t really represent what makes her work unique.

“What I pride myself on as an artist are authentic expressions, appealing design, and relatable characters. And I feel like that is something that I see AI, in general, struggle with most of all,” Hollie said.

Four of Hollie’s illustrations used to train the AI model (left) and sample AI output (right)

“I feel like AI can kind of mimic brush textures and rendering, and pick up on some colors and shapes, but that’s not necessarily what makes you really hireable as an illustrator or designer. If you think about it, the rendering, brushstrokes, and colors are the most surface-level area of art. I think what people will ultimately connect to in art is a lovable, relatable character. And I’m seeing AI struggling with that.”

“As far as the characters, I didn’t see myself in it. I didn’t personally see the AI making decisions that that I would make, so I did feel distance from the results. Some of that frustrated me because it feels like it isn’t actually mimicking my style, and yet my name is still part of the tool.”

She wondered if the model’s creator simply didn’t think of her as a person. “I kind of feel like when they created the tool, they were thinking of me as more of a brand or something, rather than a person who worked on their art and tried to hone things, and that certain things that I illustrate are a reflection of my life and experiences that I’ve had. Because I don’t think if a person was thinking about it that way that they would have done it. I think it’s much easier to just convince yourself that you’re training it to be like an art style, but there’s like a person behind that art style.”

“For me, personally, it feels like someone’s taking work that I’ve done, you know, things that I’ve learned — I’ve been a working artist since I graduated art school in 2011 — and is using it to create art that that I didn’t consent to and didn’t give permission for,” she said. “I think the biggest thing for me is just that my name is attached to it. Because it’s one thing to be like, this is a stylized image creator. Then if people make something weird with it, something that doesn’t look like me, then I have some distance from it. But to have my name on it is ultimately very uncomfortable and invasive for me.”


I reached out to MysteryInc152 on Reddit to see if they’d be willing to talk about their work, and we set up a call.

MysteryInc152 is Ogbogu Kalu, a Nigerian mechanical engineering student in New Brunswick, Canada. Ogbogu is a fan of fantasy novels and football, comics and animation, and now, generative AI.

His initial hope was to make a series of comic books, but knew that doing it on his own would take years, even if he had the writing and drawing skills. When he first discovered Midjourney, he got excited and realized that it could work well for his project, and then Stable Diffusion dropped.

Unlike Midjourney, Stable Diffusion was entirely free, open-source, and supported powerful creative tools like img2img, inpainting, and outpainting. It was nearly perfect, but achieving a consistent 2D comic book style was still a struggle. He first tried hypernetwork style training, without much success, but DreamBooth finally gave him the results he was looking for.

Before publishing his model, Ogbogu wasn’t familiar with Hollie Mengert’s work at all. He was helping another Stable Diffusion user on Reddit who was struggling to fine-tune a model on Hollie’s work and getting lackluster results. He refined the image training set, got to work, and published the results the following day. He told me the training process took about 2.5 hours on a GPU at Vast.ai, and cost less than $2.

Reading the Reddit thread, his stance on the ethics seemed to border on fatalism: the technology is inevitable, everyone using it is equally culpable, and any moral line is completely arbitrary. In the Reddit thread, he debated with those pointing out a difference between using Stable Diffusion as-is and fine-tuning an AI on a single living artist:

There is no argument based on morality. That’s just an arbitrary line drawn on the sand. I don’t really care if you think this is right or wrong. You either use Stable Diffusion and contribute to the destruction of the current industry or you don’t. People who think they can use [Stable Diffusion] but are the ‘good guys’ because of some funny imaginary line they’ve drawn are deceiving themselves. There is no functional difference.

On our call, I asked him what he thought about the debate. His take was very practical: he thinks it’s legal to train and use, likely to be determined fair use in court, and you can’t copyright a style. Even though you can recreate subjects and styles with high fidelity, the original images themselves aren’t stored in the Stable Diffusion model, with over 100 terabytes of images used to create a tiny 4 GB model. He also thinks it’s inevitable: Adobe is adding generative AI tools to Photoshop, Microsoft is adding an image generator to their design suite. “The technology is here, like we’ve seen countless times throughout history.”

Toward the end of our conversation, I asked, “If it’s fair use, it doesn’t really matter in the eye of the law what the artist thinks. But do you think, having done this yourself and released a model, if they don’t find flattering, should the artist have any say in how their work is used?”

He paused for a few seconds. “Yeah, that’s… that’s a different… I guess it all depends. This case is rather different in the sense that it directly uses the work of the artists themselves to replace them.” Ultimately, he thinks many of the objections to it are a misunderstanding of how it works: it’s not a form of collage, it’s creating new images and clearly transformative, more like “trying to recall a vivid memory from your past.”

“I personally think it’s transformative,” he concluded. “If it is, then I guess artists won’t really have a say in how these models get written or not.”


As I was playing around with the model trained on myself, I started thinking about how cheap and easy it was to make. In the short term, we’re going to see fine-tuned for anything you can imagine: there are over 700 models in the Concepts Library on HuggingFace so far, and trending in the last week alone on Reddit, models based on classic Disney animated films, modern Disney animated films, Tron: Legacy, Cyberpunk: Edgerunners, K-pop singers, and Kurzgesagt videos.

Images generated using the “Classic Animation” DreamBooth model trained on Disney animated films

Aside from the IP issues, it’s absolutely going to be used by bad actors: models fine-tuned on images of exes, co-workers, and, of course, popular targets of online harassment campaigns. Combining those with any of the emerging NSFW models trained on large corpuses of porn is a disturbing inevitability.

DreamBooth, like most generative AI, has incredible creative potential, as well as incredible potential for harm. Missing in most of these conversations is any discussion of consent.


The day after we spoke, Ogbogu Kalu reached out to me through Reddit to see how things went with Hollie. I said she wasn’t happy about it, that it felt invasive and she had concerns about it being associated with her name. If asked for permission, she would have said no, but she also didn’t own the rights to several of the images and couldn’t have given permission even if she wanted to.

“I figured. That’s fair enough,” he responded. “I did think about using her name as a token or not, but I figured since it was a single artist, that would be best. Didn’t want it to seem like I was training on an artist and obscuring their influence, if that makes sense. Can’t change that now unfortunately but I can make it clear she’s not involved.”

Two minutes later, he renamed the Huggingface model from hollie-mengert-artstyle to the more generic Illustration-Diffusion, and added a line to the README, “Hollie is not affiliated with this.”

Two days later, he released a new model trained on 40 images by concept and comic book artist James Daly III.

Art by James Daly III (left) vs. images generated by Stable Diffusion fine-tuned on his work

Winding Down Skittish

After we cancelled XOXO in the early days of the pandemic, I spent much of 2020 wondering if there was any way to recreate the unique experience of a real-world festival like XOXO online: the serendipity of meeting new people while running between venues, getting drawn into conversations with strangers and old friends, stepping in and out of conversations as simply as walking away.

Those explorations ended up in Skittish, a colorful and customizable browser-based 3D world that let you talk to others nearby with your voice while controlling a goofy animal character. It was weird and silly and I loved it.

Sadly, for a number of reasons, I’ve decided to shut it down this December.

There isn’t enough demand to keep it going, probably a combination of an overall decline in demand for virtual events and its niche featureset, and it requires ongoing support and maintenance that’s hard for me to provide on my own.

An Exciting Beta

I’m exceedingly proud of what we built with such tight resources:

  • A 3D environment in the browser, fully functional on desktop and mobile browsers
  • High-quality spatial audio based on your location
  • Built-in editor for collaborative world editing
  • Public and private audio spaces
  • Embedded videos and livestreams
  • Roles for organizers, speakers, editors
  • Chat with moderation tools
  • Subscription payments integration with Stripe

Throughout most of last year, we ran events large and small as part of an invite-only beta. Festivals, conferences, workshops, a summer camp, and meetups of all sizes. In that first year, it got some great press from TechCrunch, The Verge, and others, and feedback was incredibly positive. It seemed like we built something that was niche, but unique and special.

A Rocky Launch

The problems started shortly after our public launch in November 2021, when we opened signups to everyone.

Two weeks in, our audio provider, High Fidelity, dealt us some devastating news: they were shutting down the spatial audio API that made Skittish possible — with only six weeks’ notice.

We scrambled to negotiate an extension with them through January 2022, eventually hosting the service on our own AWS infrastructure at great expense to prevent a disruption of service while we migrated to Dolby.io, the only other service that provided a spatial audio API.

As we were tied up with these infrastructure changes for over two months, the winds were shifting.

Waning Demand

Starting in late 2021, covid restrictions lifted virtually everywhere, and despite the pandemic raging on, event organizers were resuming in-person events.

Consequently, the demand for virtual events dropped off a cliff. This is an industry-wide trend: Hopin cut 30% of their staff, only four months after a hiring freeze and major round of layoffs. Gather laid off a third of staff and announced they were pivoting away from virtual events entirely, focusing solely on remote work collaboration. It seems like many platforms are doing the same, or quietly folding.

I still believe there are huge benefits to virtual events, especially for those devoting the time to building thoughtful social spaces like Roguelike Celebration, but the demand for virtual event platforms like ours feels very low right now, partly because we built something niche, but also simply because many people just want to meet in person now, regardless of health risk.

As our revenue dried up, we also ran out of cash. Skittish was initially funded from a grant provided by Grant for the Web, and I considered doing a small fundraise, but with the future looking so uncertain, I decided it wasn’t worth the risk.

Skittish doesn’t cost much to run without contractors, but it’s still losing money. And frankly, I’m not well-equipped to adequately support it and continue development entirely by myself.

Calling It Quits

So, as much as I love it, Skittish is winding down. I’ve already disabled upgrading to paid plans, and will disable signups on December 14. If you have unusual circumstances and need access to it longer, get in touch.

I always knew there would be risk building something like this during the pandemic. Fortunately, I built it in a way where nobody would be burned: we fully honored the terms of our grant funding, it never had investors, never took on debt, never had employees, and I’ve made sure no paying customer will be negatively affected.

I’m extremely grateful to Grant for the Web for the initial grant funding, all the events and organizations that used Skittish over the last two years, and everyone who worked on it — but especially Simon Hales for his herculean effort on every part of the platform.

Thanks for playing!

The very busy Flash Forward wrap party

AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability

Yesterday, Meta’s AI Research Team announced Make-A-Video, a “state-of-the-art AI system that generates videos from text.”

Like he did for the Stable Diffusion data, Simon Willison created a Datasette browser to explore WebVid-10M, one of the two datasets used to train the video generation model, and quickly learned that all 10.7 million video clips were scraped from Shutterstock, watermarks and all.

In addition to the Shutterstock clips, Meta also used 10 million video clips from this 100M video dataset from Microsoft Research Asia. It’s not mentioned on their GitHub, but if you dig into the paper, you learn that every clip came from over 3 million YouTube videos.

So, in addition to a massive chunk of Shutterstock’s video collection, Meta is also using millions of YouTube videos collected by Microsoft to make its text-to-video AI.

Non-Commercial Use

The academic researchers who compiled the Shutterstock dataset acknowledged the copyright implications in their paper, writing, “The use of data collected for this study is authorised via the Intellectual Property Office’s Exceptions to Copyright for Non-Commercial Research and Private Study.”

But then Meta is using those academic non-commercial datasets to train a model, presumably for future commercial use in their products. Weird, right?

Not really. It’s become standard practice for technology companies working with AI to commercially use datasets and models collected and trained by non-commercial research entities like universities or non-profits.

In some cases, they’re directly funding that research.

For example, many people believe that Stability AI created the popular text-to-image AI generator Stable Diffusion, but they funded its development by the Machine Vision & Learning research group at the Ludwig Maximilian University of Munich. In their repo for the project, the LMU researchers thank Stability AI for the “generous compute donation” that made it possible.

The massive image-text caption datasets used to train Stable Diffusion, Google’s Imagen, and the text-to-image component of Make-A-Video weren’t made by Stability AI either. They all came from LAION, a small nonprofit organization registered in Germany. Stability AI directly funds LAION’s compute resources, as well.

Shifting Accountability

Why does this matter? Outsourcing the heavy lifting of data collection and model training to non-commercial entities allows corporations to avoid accountability and potential legal liability.

It’s currently unclear if training deep learning models on copyrighted material is a form of infringement, but it’s a harder case to make if the data was collected and trained in a non-commercial setting. One of the four factors of the “fair use” exception in U.S. copyright law is the purpose or character of the use. In their Fair Use Index, the U.S. Copyright Office writes:

“Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair.”

A federal court could find that the data collection and model training was infringing copyright, but because it was conducted by a university and a nonprofit, falls under fair use.

Meanwhile, a company like Stability AI would be free to commercialize that research in their own DreamStudio product, or however else they choose, taking credit for its success to raise a rumored $100M funding round at a valuation upwards of $1 billion, while shifting any questions around privacy or copyright onto the academic/nonprofit entities they funded.

This academic-to-commercial pipeline abstracts away ownership of data models from their practical applications, a kind of data laundering where vast amounts of information are ingested, manipulated, and frequently relicensed under an open-source license for commercial use.

Unforeseen Consequences

Years ago, like many people, I used to upload my photos to Flickr with a Creative Commons license that required attribution and allowed non-commercial use. Yahoo released a database of 100 million of those Creative Commons-licensed images for academic research, to help the burgeoning field of AI. Researchers at the University of Washington took 3.5 million of the Flickr photos with faces in them, over 670,000 people (including me), and released the MegaFace dataset, part of a research competition sponsored by Google and Intel.

I was happy to let people remix and reuse my photos for non-commercial use with attribution, but that’s not how they were used. Instead, academic researchers took the work of millions of people, stripped it of attribution against its license terms, and redistributed it to thousands of groups, including corporations, military agencies, and law enforcement.

In their analysis for Exposing.ai, Adam Harvey and Jules LaPlace summarized the impact of the project:

[The] MegaFace face recognition dataset exploited the good intentions of Flickr users and the Creative Commons license system to advance facial recognition technologies around the world by companies including Alibaba, Amazon, Google, CyberLink, IntelliVision, N-TechLab (FindFace.pro), Mitsubishi, Orion Star Technology, Philips, Samsung1, SenseTime, Sogou, Tencent, and Vision Semantics to name only a few. According to the press release from the University of Washington, “more than 300 research groups [were] working with MegaFace” as of 2016, including multiple law enforcement agencies.

That dataset was used to build the facial recognition AI models that now power surveillance tech companies like Clearview AI, in use by law enforcement agencies around the world, as well as the U.S. Army. The Chinese government has used it to train their surveillance systems. As the New York Times reported last year:

MegaFace has been downloaded more than 6,000 times by companies and government agencies around the world, according to a New York Times public records request. They included the U.S. defense contractor Northrop Grumman; In-Q-Tel, the investment arm of the Central Intelligence Agency; ByteDance, the parent company of the Chinese social media app TikTok; and the Chinese surveillance company Megvii.

The University of Washington eventually decommissioned the dataset and no longer distributes it. I don’t think any of those researchers, or even the people at Yahoo who decided to release the photos in the first place, ever foresaw how it would later be used.

408 of about 4,753,520 face images from the MegaFace face recognition training and benchmarking dataset. Visualization by Adam Harvey of Exposing.ai.

They were motivated to push AI forward and didn’t consider the possible repercussions. They could have made inclusion into the dataset opt-in, but they didn’t, probably because it would’ve been complicated and the data wouldn’t have been nearly as useful. They could have enforced the license and restricted commercial use of the dataset, but they didn’t, probably because it would have been a lot of work and probably because it would have impacted their funding.

Asking for permission slows technological progress, but it’s hard to take back something you’ve unconditionally released into the world.


As I wrote about last month, I’m incredibly excited about these new AI art systems. The rate of progress is staggering, with three stunning announcements yesterday alone: aside from Meta’s Make-A-Video, there was also DreamFusion for text-to-3D synthesis and Phenaki, another text-to-video model capable of making long videos with prompts that change over time.

But I grapple with the ethics of how they were made and the lack of consent, attribution, or even an opt-out for their training data. Some are working on this, but I’m skeptical: once a model is trained on something, it’s nearly impossible for it to forget. (At least for now.)

Like with the artists, photographers, and other creators found in the 2.3 billion images that trained Stable Diffusion, I can’t help but wonder how the creators of those 3 million YouTube videos feel about Meta using their work to train their new model.

A mysterious voice is haunting American Airlines’ in-flight announcements and nobody knows how

Here’s a little mystery for you: there are multiple reports of a mysterious voice grunting, moaning, and groaning on American Airlines’ in-flight announcement systems, sometimes lasting the duration of the flight — and nobody knows who’s responsible or how they did it.

Actor/producer Emerson Collins was the first to post video, from his Denver flight on September 6:

Here’s an MP3 of the audio with just the groans, moans, and grunts, with some of the background noise filtered out.

This is the only video evidence so far, but Emerson is one of several people who have experienced this on multiple different American Airlines flights. This thread from JonNYC collected several different reports from airline employees and insiders, on both Airbus A321 and Boeing 737-800 planes.

Other people have reported similar experiences, always on American Airlines, going as far back as July. Every known incident has gone through the greater Los Angeles area (including Santa Ana) or Dallas-Fort Worth. Here are all the incidents I’ve seen so far, in chronological order:

  • July – American Airlines, JFK to LAX. Bradley P. Allen wrote, “My wife and I experienced this during an AA flight in July. To be clear, it was just sounds like the moans and groans of someone in extreme pain. The crew said that it had happened before, and had no explanation. Occurred briefly 3 or 4 times early in the flight, then stopped.” (Additional flight details via the LA Times.)
  • August 5 – American Airlines 117. JFK to LAX. Wendy Wanderman wrote, “It happened on my flight August 5 from JFK to LAX and it was an older A321 that I was on. It was Flight 117. There was flight crew that was on the same plane a couple days earlier and the same thing happened. It was funny and unsettling.”
  • September 6 – American Airlines. Santa Ana, CA to Dallas-Fort Worth. Emerson Collins’ flight. “These sounds started over the intercom before takeoff and continued throughout the flight. They couldn’t stop it, and after landing still had no idea what it was… I filmed about fifteen minutes, then again during service. It was calmer for a while mid flight.”
  • Mid-September – American Airlines, Airbus A320. Orlando, FL to Dallas-Fort Worth. Doug Boehner wrote, “This happened to me last week. It wasn’t the whole flight, but periodically weird phrases and sounds. Then a huge ‘oh yeah’ when we landed. We thought the pilot left his mic open.”
  • September 18 – American Airlines 1631, Santa Ana, CA to Dallas-Fort Worth. Boeing 737-800. An anonymous report passed on by JonNYC, “Currently on AA1631 and someone keeps hacking into the PA and making moaning and screaming sounds 😨 the flight attendants are standing by their phones because it isn’t them and the captain just came on and told us they don’t think the flight systems are compromised so we will finish the flight to DFW. Sounded like a male voice and wouldn’t last more than 5-10 seconds before stopping. And has [intermittently] happened on and off all flight long.” (And here’s a second person on the same flight.)

Interestingly, JonNYC followed up with the person who reported the incident on September 18 and asked if it sounded like the same voice in the video. “Very very similar. Same voice! But ours was less aggressive. Although their volume might have been turned up more making it sound more aggressive. 100% positive same voice.

Official Response

View from the Wing’s Gary Leff asked American Airlines about the issue, and their official response is that it’s a mechanical issue with the PA amplifier. The LA Times followed up on Saturday, with slightly more information:

“Our maintenance team thoroughly inspected the aircraft and the PA system and determined the sounds were caused by a mechanical issue with the PA amplifier, which raises the volume of the PA system when the engines are running,” said Sarah Jantz, a spokesperson for American.

Jantz said the P.A. systems are hardwired with no external access and no Wi-Fi component. The airline’s maintenance team is reviewing the additional reports. Jantz did not respond to questions about how many reports it has received and whether the reports are from different aircrafts.

This explanation feels incomplete to me. How can an amplifier malfunction broadcast what sounds like a human voice without external access? On multiple flights and aircraft? They seem to be saying the source is artificial, but has anyone heard artificial noise that sounds this human?

Why This Is So Bizarre

By nature, passenger announcement systems on planes are hardwired, closed systems, making them incredibly difficult to hack. Professional reverse engineer/hardware hacker/security analyst Andrew Tierney (aka Cybergibbons) dug up the Airbus 321 documents in this thread.

“And on the A321 documents we have, the passenger announcement system and interphone even have their own handsets. Can’t see how IFE or WiFi would bridge,” Tierney wrote. “Also struggling to see how anyone could pull a prank like this.”

This report found by aviation watchdog JonNYC, posted by a flight attendant on an internal American Airlines message board, points to some sort of as-yet-undiscovered remote exploit.

We also know that, at least on Emerson Collins’ flight, there was no in-seat entertainment, eliminating that as a possible exploit vector.

Theories

So, how could this happen? There are a handful of theories, but they’re very speculative.

Medical Intercom

The first to emerge was this now-debunked theory came from “a former avionics guy” posting in r/aviation on Reddit:

The most likely culprit IMHO is the medical intercom. There are jacks mounted in the overhead bins at intervals down the full length of the airplane that have both receive, transmit and key controls. All somebody would need to do is plug a homemade dongle with a Bluetooth receiver into one of those, take a trip to the lav and start making noises into a paired mic.

The fact that the captain’s announcements are overriding (ducking) it but the flight attendants aren’t is also an indication it’s coming from that system.

If this was how it was done, there’s no reason the prankster would need to hide in the bathrooms: they could trigger a soundboard or prerecorded audio track from their seat.

However, this theory is likely a dead end. JonNYC reports that an anonymous insider confirmed they no longer exist on American Airlines flights. And even if they existed, the medical intercoms didn’t patch into the announcement system. They only allow flight crew to talk to medical staff on the ground.

Pre-Recorded Audio Message Bug

Another theory, also courtesy of JonNYC, is that there’s an issue with the pre-recorded audio messages (“PRAM”), which were replaced in the last 60 days, within the timeframe of all these incidents. Perhaps some test audio was added to the end of a message, maybe by an engineer who worked on it, and it’s accidentally playing that extra audio?

Artificial Noise

Finally, some firmly believe that it’s not a human voice at all, but artificial noise or audio feedback filtered through the announcement system.

Nick Anderegg, an engineer with a background in linguistics and phonology, says it’s the results of “random signal passed through a system that extracts human voices.”

Anderegg points to a sound heard at the 1:20 mark in Emerson’s video, a “sweep across every frequency,” as evidence that American Airlines’ explanation is accurate.

Personally, I struggle with this explanation. The wide variation of utterances heard during Emerson’s three-hour flight are so wildly different, from groans and grunts to moans and shouts, that it’s difficult to imagine it as anything else but human. It’s far from impossible, but I’d love to see anyone try to recreate these sounds with random noise or feedback.

Any Other Ideas?

Any other theories how this might be possible? I’d love to hear them, and I’ll keep this post updated. My favorite theory so far:

Photoillustration of an American Airlines jet headed towards a scared cartoon cloud

Online Art Communities Begin Banning AI-Generated Images

As AI-generated art platforms like DALL-E 2, Midjourney, and Stable Diffusion explode in popularity, online communities devoted to sharing human-generated art are forced to make a decision: should AI art be allowed?

Collage of dozens of images made with Stable Diffusion, indexed by Lexica

On Sunday, popular furry art community Fur Affinity announced that AI-generated art was not allowed because it “lacked artistic merit.” (In July, one AI furry porn generator was uploading one image every 40 seconds before it was banned.) Their new guidelines are very clear:

Content created by artificial intelligence is not allowed on Fur Affinity.

AI and machine learning applications (DALL-E, Craiyon) sample other artists’ work to create content. That content generated can reference hundreds, even thousands of pieces of work from other artists to create derivative images.

Our goal is to support artists and their content. We don’t believe it’s in our community’s best interests to allow AI generated content on the site.

Last year, the 27-year-old art/animation portal Newgrounds banned images made with Artbreeder, a tool for “breeding” GAN-generated art. Late last month, Newgrounds rewrote their guidelines to explicitly disallow images generated by new generation of AI art platforms:

AI-generated art is not allowed in the Art Portal. This includes using tools such as Midjourney, Dall-E, and Craiyon, in addition fractal generators and websites like ArtBreeder, where the user selects two images and they are combined into a new image via machine learning.

There are cases where some use of AI is ok, for example if you are primarily showcasing your character art but use an AI-generated background. In these cases, please note any elements where AI was used so that it is clear to users and moderators.

Tracing and coloring over AI-generated art is something best shared on your blog, as it is much like tracing over someone else’s art.

Bottom line: We want to keep the focus on art made by people and not have the Art Portal flooded with computer-generated art.

It’s not just long-running online communities: InkBlot is a budding art platform funded on Kickstarter in 2021 that went into open beta just this week. They’ve already taken a “no tolerance” policy against AI art, and updating their terms of service to exclude it.


Platforms that haven’t taken a stand are now facing public pressure to clarify their policies.

DeviantArt is one of the most popular online art communities, and increasingly, members are complaining that their feeds are getting flooded with AI-generated art. One of the most popular threads in their forums right now asks the staff to “combat AI art” by limiting daily uploads, either by segregating it under a special category or to ban it entirely.

ArtStation has also been quiet as AI-generated images grow in popularity there. “Trending on ArtStation” is one of the most popular prompts for AI art because of the particular aesthetic and quality of work found there, which nudges the AI to generate work scraped from it, leading to a future ouroboros where AI models will be trained on AI-generated art found there.


However you feel about the ethics of AI art, online art communities are facing a very real problem of scale: AI art can be created orders of magnitude faster than traditional human-made art. A powerful GPU can generate thousands of images an hour, even while you sleep.

Lexica, a search engine that solely indexed images from Stable Diffusion’s beta tests in Discord, has over 10 million images in it. It would take a lifetime to explore everything in it, a corpus made by a relatively small group of beta testers in a few weeks.

Left unchecked, it’s not hard to imagine AI art crowding out illustrations that took days or weeks for someone to make.

To keep their communities active, community admins and moderators will have to decide what to do with AI art: allow it, segregate it, or ban it entirely.

Perfect Tides, a Coming-of-Age Point-and-Click Adventure, Kickstarts a Sequel

There’s no shortage of amazing games so far this year, but my personal favorite is an underdog: Perfect Tides, a ’90s-esque point-and-click adventure about growing up as a teen on a sleepy island resort town in the early 2000s, finding an escape from real-life feelings of loneliness and loss in discussion forums and late-night AIM chats.

Mara and her friend Lily on the beach… definitely not on drugs

The first game from Meredith Gran, creator of the decade-long comic series Octopus Pie, it approaches challenging subjects with the confidence of someone who created narrative comics every week for ten years. I can’t think of another comics artist who has dived into game design like this, but it pays off with uniquely charming pixel art and animation, colorful writing, and a story that genuinely moved me by the end. It navigates complex feelings about family, old friends, and new loves, while also being genuinely funny.

Let’s put it this way: I’ve been playing videogames for the last 35 years, but Perfect Tides is the first time I felt compelled to write a walkthrough (spoilers!) and actively participate in forums to help people finish it.

This is a long way of saying that you should play Perfect Tides on Steam or Itch, and then go back the Kickstarter for its sequel, Perfect Tides: Station to Station, which has only six days to go and still needs another $20,000 to cross the finish line. (Update: It hit the goal!)

you should probably go back this project

But you don’t have to take my word for it! Kotaku said the original game was “one of the year’s best,” the “kind of game you don’t even see coming, yet turns out to be incredible” and “perfectly captures the intensity and struggle of adolescence.” AV Club called it a “harrowing, funny, beautiful, horrifying, and ultimately reassuring work of art.” Polygon summed it up as “devastatingly honest.” My favorite review was from Buried Treasure’s John Walker, who wrote, “It is the most extraordinary exploration of what it is to be a teenager, told with such heart, such truth.”

Spoilers Ahoy

If you’ve already played Perfect Tides, I want to mention two key moments that are so wonderful, and yet so easy to miss in your first playthrough, they’re worth replaying it for. THESE ARE SPOILERS!

First, if you didn’t manage to patch things up with Lily, you missed a long sequence with her in the final season of the game. (To get the full experience of that sequence, you’ll need to find a specific MP3 and put into the game directory when prompted: a remarkable breaking-the-fourth-wall sidestep around copyright licensing that I’ve never seen in a game before.)

Second, there are two major endings. If it feels anticlimactic, you likely didn’t resolve your conflicts with Lily, Simon, and your family. There are 95 possible points, but you don’t need them all to get the best ending. Feel free to use my 100% completion guide for help getting there.

Perfect Tides isn’t perfect. Like any classic point-and-click adventure, there are some clunky bits here and there, and you’ll likely need the occasional hint or glance at a playthrough to finish. But it’s so worth it.

Mara talks to an online friend

Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator

One of the biggest frustrations of text-to-image generation AI models is that they feel like a black box. We know they were trained on images pulled from the web, but which ones? As an artist or photographer, an obvious question is whether your work was used to train the AI model, but this is surprisingly hard to answer.

Sometimes, the data isn’t available at all: OpenAI has said it’s trained DALL-E 2 on hundreds of millions of captioned images, but hasn’t released the proprietary data. By contrast, the team behind Stable Diffusion have been very transparent about how their model is trained. Since it was released publicly last week, Stable Diffusion has exploded in popularity, in large part because of its free and permissive licensing, already incorporated into the new Midjourney beta, NightCafe, and Stability AI’s own DreamStudio app, as well as for use on your own computer.

But Stable Diffusion’s training datasets are impossible for most people to download, let alone search, with metadata for millions (or billions!) of images stored in obscure file formats in large multipart archives.

So, with the help of my friend Simon Willison, we grabbed the data for over 12 million images used to train Stable Diffusion, and used his Datasette project to make a data browser for you to explore and search it yourself. Note that this is only a small subset of the total training data: about 2% of the 600 million images used to train the most recent three checkpoints, and only 0.5% of the 2.3 billion images that it was first trained on.

Screenshot of the LAION-Aesthetic data browser, showing results from a search for Swedish artist Simon Stålenhag with thumbnail images
Screenshot of the LAION-Aesthetic data browser in Datasette

Go try it right now at laion-aesthetic.datasette.io! (Update: It’s now offline. See below for details.)

Read on to learn about how this dataset was collected, the websites it most frequently pulled images from, and the artists, famous faces, and fictional characters most frequently found in the data.

Data Source

Stable Diffusion was trained off three massive datasets collected by LAION, a nonprofit whose compute time was largely funded by Stable Diffusion’s owner, Stability AI.

All of LAION’s image datasets are built off of Common Crawl, a nonprofit that scrapes billions of webpages monthly and releases them as massive datasets. LAION collected all HTML image tags that had alt-text attributes, classified the resulting 5 billion image-pairs based on their language, and then filtered the results into separate datasets using their resolution, a predicted likelihood of having a watermark, and their predicted “aesthetic” score (i.e. subjective visual quality).

Collage of some of the images with the highest “aesthetic” score, largely watercolor landscapes and portraits of women

Stable Diffusion’s initial training was on low-resolution 256×256 images from LAION-2B-EN, a set of 2.3 billion English-captioned images from LAION-5B‘s full collection of 5.85 billion image-text pairs, as well as LAION-High-Resolution, another subset of LAION-5B with 170 million images greater than 1024×1024 resolution (downsampled to 512×512).

Its last three checkpoints were on LAION-Aesthetics v2 5+, a 600 million image subset of LAION-2B-EN with a predicted aesthetics score of 5 or higher, with low-resolution and likely watermarked images filtered out.

For our data explorer, we originally wanted to show the full dataset, but it’s a challenge to host a 600 million record database in an affordable, performant way. So we decided to use the smaller LAION-Aesthetics v2 6+, which includes 12 million image-text pairs with a predicted aesthetic score of 6 or higher, instead of the 600 million rated 5 or higher used in Stable Diffusion’s training.

This should be a representative sample of images used to train Stable Diffusion’s last three checkpoints, but skewing towards more aesthetically-attractive images. Note that LAION provides a useful frontend to search the CLIP embeddings computed from their 400M and 5 billion image datasets, but it doesn’t allow you to search the original captions.

Source Domains

We know the captioned images used for Stable Diffusion were scraped from the web, but from where? We indexed the 12 million images in our sample by domain to find out.

Nearly half of the images, about 47%, were sourced from only 100 domains, with the largest number of images coming from Pinterest. Over a million images, or 8.5% of the total dataset, are scraped from Pinterest’s pinimg.com CDN.

User-generated content platforms were a huge source for the image data. WordPress-hosted blogs on wp.com and wordpress.com represented 819k images together, or 6.8% of all images. Other photo, art, and blogging sites included 232k images from Smugmug, 146k from Blogspot, 121k images were from Flickr, 67k images from DeviantArt, 74k from Wikimedia, 48k from 500px, and 28k from Tumblr.

Shopping sites were well-represented. The second-biggest domain was Fine Art America, which sells art prints and posters, with 698k images (5.8%) in the dataset. 244k images came from Shopify, 189k each from Wix and Squarespace, 90k from Redbubble, and just over 47k from Etsy.

Unsurprisingly, a large number came from stock image sites. 123RF was the biggest with 497k, 171k images came from Adobe Stock’s CDN at ftcdn.net, 117k from PhotoShelter, 35k images from Dreamstime, 23k from iStockPhoto, 22k from Depositphotos, 22k from Unsplash, 15k from Getty Images, 10k from VectorStock, and 10k from Shutterstock, among many others.

It’s worth noting, however, that domains alone may not represent the actual sources of these images. For instance, there are only 6,292 images sourced from Artstation.com’s domain, but another 2,740 images with “artstation” in the caption text hosted by sites like Pinterest.

Artists

We wanted to understand how artists were represented in the dataset, so used the list of over 1,800 artists in MisterRuffian’s Latent Artist & Modifier Encyclopedia to search the dataset and count the number of images that reference each artist’s name. You can browse and search those artist counts here, or try searching for any artist in the images table. (Searching with quoted strings is recommended.)

Of the top 25 artists in the dataset, only three are still living: Phil Koch, Erin Hanson, and Steve Henderson. The most frequent artist in the dataset? The Painter of Light™ himself, Thomas Kinkade, with 9,268 images.

From a list of 1,800 popular artists, the top 10 found most frequently in the captioned images

Using the “type” field in the database, you can see the most frequently-found artists in each category: for example, looking only at comic book artists, Stan Lee’s name is found most often in the image captions. (As one commenter pointed out, Stan Lee was a comic book writer, not an artist, but people are using his name to generate images in the style of comic book art he was associated with.)

Some of the most-cited recommended artists used in AI image prompting aren’t as pervasive in the dataset as you’d expect. There are only 15 images that mention fantasy artist Greg Rutkowski, whose name is frequently used as a prompt modifier, and only 73 from James Gurney.

(It’s worth saying again that these images are just a subset of one of three datasets used to train the AI, so an artist’s work may have been used elsewhere in the data even if they’re not found in these 12M images.)

Famous People

Unlike DALL-E 2, Stable Diffusion doesn’t have any limitations on generating images of people named in the dataset. To get a sense of how well-represented well-known people are in the dataset, we took two lists of celebrities and other famous names and merged it into a list of nearly 2,000 names. You can see the results of those celebrity counts here, or search for any name in the images table. (Obviously, some of the top searches like “Pink” and “Prince” include results that don’t refer to that person.)

Donald Trump is one of the most cited names in the image dataset, with nearly 11,000 photos referencing his name. Charlize Theron is a close runner-up with 9,576 images.

A full gender breakdown would take more time, but at a glance, it seems like many of the most popular names in the dataset are women.

Strangely, enormously popular internet personalities like David Dobrik, Addison Rae, Charli D’Amelio, Dixie D’Amelio, and MrBeast don’t appear in the captions from the dataset at all. My hunch was that the CommonCrawl data was too old to include these more recent celebrities, but based on the URLs, there are tens of thousands of images from last year in the data. (If you can solve this mystery, get in touch or leave a comment!)

Fictional Characters

Finally, we took a look at how popular fictional characters are represented in the dataset, since this is subject matter that’s enormously popular using Stable Diffusion and Craiyon, but often impossible with DALL-E 2, as you can see in this Mickey Mouse example from my previous post.

For this set of searches, we used this list of 600 fictional characters from pop culture to search the image dataset. You can browse the results here, or search for any other character in the images table. (Again, be aware that one-word character names like “Link,” “Data,” and “Mario” are likely to have many more results unrelated to that character.)

Characters from the MCU like Captain Marvel (4,993 images), Black Panther (4,395), and Captain America (3,155) are some of the best represented in the dataset. Batman (2,950) and Superman (2,739) are neck and neck. Luke Skywalker (2,240) has more images than Darth Vader (1.717) and Han Solo (1,013). Mickey Mouse barely breaks the top 100 with 520 images.

NSFW Content

Finally, let’s take a brief look at the representation of adult material, another huge difference between Stable Diffusion and any other model. OpenAI rigorously removed sexual/violent content from its training data and blocked potentially NSFW keywords from prompts.

The Stable Diffusion team built a predictor for adult material and assigned every image a NSFW probability score, which you can see in the “punsafe” field in the images table, ranging from 0 to 1. (Warning: Obviously, sorting by that field will show the most NSFW images in the dataset.)

In their announcements of the full LAION-5B dataset, LAION team member Romain Beaumont estimated that about 2.9% of the English-language images were “unsafe,” but in browsing this dataset, it’s not clear how their predictors defined that.

There’s definitely NSFW material in the image dataset, but surprisingly little of it. Only 222 images got a “1” unsafe probability score, indicating 100% confidence that it’s unsafe, about 0.002% of the total images — and those are definitely porn. But nudity seems to be unusual outside of that confidence level: even images with a 0.9999 punsafe score (99.99% confidence) rarely have nudity in them.

It’s plausible that filtering on aesthetic ratings is removing huge amounts of NSFW content from the image dataset, and the full dataset contains much more. Or maybe their definitions of what is “unsafe” are very broad.

More Info

Again, huge thanks to Simon Willison for working with me on this: he did all the heavy lifting of hosting the data. He wrote a detailed post about making the search engine if you want more technical detail. His Datasette project is open-source, extremely flexible, and worth checking out. If you’re interested in playing with this data yourself, you can use the scripts in his GitHub repo to download and import it into a SQLite database.

If you find anything interesting in the data, or have any questions, feel free to drop them in the comments.

Update

On December 20, 2023, LAION took down its LAION-5B and LAION-400M datasets after a new study published by the Stanford Internet Observatory found that it included links to child sexual abuse material. As reported by 404 Media, “The LAION-5B machine learning dataset used by Stable Diffusion and other major AI products has been removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material, 1,008 of which were externally validated.”

The subset of “aesthetic” images we analyzed was only 2% of the full 2.3 billion image dataset, and of those 12 million images, only 222 images were classified as NSFW. As a result, it’s unlikely any of those links go to CSAM imagery, but because it’s impossible to know with certainty, Simon took the precaution of permanently shuttering the LAION-Aesthetic browser.