Perfect Tides, a Coming-of-Age Point-and-Click Adventure, Kickstarts a Sequel

Posted August 31, 2022September 3, 2022 by Andy Baio

There’s no shortage of amazing games so far this year, but my personal favorite is an underdog: Perfect Tides, a ’90s-esque point-and-click adventure about growing up as a teen on a sleepy island resort town in the early 2000s, finding an escape from real-life feelings of loneliness and loss in discussion forums and late-night AIM chats.

Mara and her friend Lily on the beach… definitely not on drugs

The first game from Meredith Gran, creator of the decade-long comic series Octopus Pie, it approaches challenging subjects with the confidence of someone who created narrative comics every week for ten years. I can’t think of another comics artist who has dived into game design like this, but it pays off with uniquely charming pixel art and animation, colorful writing, and a story that genuinely moved me by the end. It navigates complex feelings about family, old friends, and new loves, while also being genuinely funny.

Let’s put it this way: I’ve been playing videogames for the last 35 years, but Perfect Tides is the first time I felt compelled to write a walkthrough (spoilers!) and actively participate in forums to help people finish it.

This is a long way of saying that you should play Perfect Tides on Steam or Itch, and then go back the Kickstarter for its sequel, Perfect Tides: Station to Station, which has only six days to go and still needs another $20,000 to cross the finish line. (Update: It hit the goal!)

you should probably go back this project

But you don’t have to take my word for it! Kotaku said the original game was “one of the year’s best,” the “kind of game you don’t even see coming, yet turns out to be incredible” and “perfectly captures the intensity and struggle of adolescence.” AV Club called it a “harrowing, funny, beautiful, horrifying, and ultimately reassuring work of art.” Polygon summed it up as “devastatingly honest.” My favorite review was from Buried Treasure’s John Walker, who wrote, “It is the most extraordinary exploration of what it is to be a teenager, told with such heart, such truth.”

Spoilers Ahoy

If you’ve already played Perfect Tides, I want to mention two key moments that are so wonderful, and yet so easy to miss in your first playthrough, they’re worth replaying it for. THESE ARE SPOILERS!

First, if you didn’t manage to patch things up with Lily, you missed a long sequence with her in the final season of the game. (To get the full experience of that sequence, you’ll need to find a specific MP3 and put into the game directory when prompted: a remarkable breaking-the-fourth-wall sidestep around copyright licensing that I’ve never seen in a game before.)

Second, there are two major endings. If it feels anticlimactic, you likely didn’t resolve your conflicts with Lily, Simon, and your family. There are 95 possible points, but you don’t need them all to get the best ending. Feel free to use my 100% completion guide for help getting there.

Perfect Tides isn’t perfect. Like any classic point-and-click adventure, there are some clunky bits here and there, and you’ll likely need the occasional hint or glance at a playthrough to finish. But it’s so worth it.

Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator

Posted August 30, 2022December 20, 2023 by Andy Baio

One of the biggest frustrations of text-to-image generation AI models is that they feel like a black box. We know they were trained on images pulled from the web, but which ones? As an artist or photographer, an obvious question is whether your work was used to train the AI model, but this is surprisingly hard to answer.

Sometimes, the data isn’t available at all: OpenAI has said it’s trained DALL-E 2 on hundreds of millions of captioned images, but hasn’t released the proprietary data. By contrast, the team behind Stable Diffusion have been very transparent about how their model is trained. Since it was released publicly last week, Stable Diffusion has exploded in popularity, in large part because of its free and permissive licensing, already incorporated into the new Midjourney beta, NightCafe, and Stability AI’s own DreamStudio app, as well as for use on your own computer.

But Stable Diffusion’s training datasets are impossible for most people to download, let alone search, with metadata for millions (or billions!) of images stored in obscure file formats in large multipart archives.

So, with the help of my friend Simon Willison, we grabbed the data for over 12 million images used to train Stable Diffusion, and used his Datasette project to make a data browser for you to explore and search it yourself. Note that this is only a small subset of the total training data: about 2% of the 600 million images used to train the most recent three checkpoints, and only 0.5% of the 2.3 billion images that it was first trained on.

Screenshot of the LAION-Aesthetic data browser, showing results from a search for Swedish artist Simon Stålenhag with thumbnail images — Screenshot of the LAION-Aesthetic data browser in Datasette

Go try it right now at laion-aesthetic.datasette.io! (Update: It’s now offline. See below for details.)

Read on to learn about how this dataset was collected, the websites it most frequently pulled images from, and the artists, famous faces, and fictional characters most frequently found in the data.

Data Source

Stable Diffusion was trained off three massive datasets collected by LAION, a nonprofit whose compute time was largely funded by Stable Diffusion’s owner, Stability AI.

All of LAION’s image datasets are built off of Common Crawl, a nonprofit that scrapes billions of webpages monthly and releases them as massive datasets. LAION collected all HTML image tags that had alt-text attributes, classified the resulting 5 billion image-pairs based on their language, and then filtered the results into separate datasets using their resolution, a predicted likelihood of having a watermark, and their predicted “aesthetic” score (i.e. subjective visual quality).

Collage of some of the images with the highest “aesthetic” score, largely watercolor landscapes and portraits of women

Stable Diffusion’s initial training was on low-resolution 256×256 images from LAION-2B-EN, a set of 2.3 billion English-captioned images from LAION-5B‘s full collection of 5.85 billion image-text pairs, as well as LAION-High-Resolution, another subset of LAION-5B with 170 million images greater than 1024×1024 resolution (downsampled to 512×512).

Its last three checkpoints were on LAION-Aesthetics v2 5+, a 600 million image subset of LAION-2B-EN with a predicted aesthetics score of 5 or higher, with low-resolution and likely watermarked images filtered out.

For our data explorer, we originally wanted to show the full dataset, but it’s a challenge to host a 600 million record database in an affordable, performant way. So we decided to use the smaller LAION-Aesthetics v2 6+, which includes 12 million image-text pairs with a predicted aesthetic score of 6 or higher, instead of the 600 million rated 5 or higher used in Stable Diffusion’s training.

This should be a representative sample of images used to train Stable Diffusion’s last three checkpoints, but skewing towards more aesthetically-attractive images. Note that LAION provides a useful frontend to search the CLIP embeddings computed from their 400M and 5 billion image datasets, but it doesn’t allow you to search the original captions.

Source Domains

We know the captioned images used for Stable Diffusion were scraped from the web, but from where? We indexed the 12 million images in our sample by domain to find out.

Nearly half of the images, about 47%, were sourced from only 100 domains, with the largest number of images coming from Pinterest. Over a million images, or 8.5% of the total dataset, are scraped from Pinterest’s pinimg.com CDN.

User-generated content platforms were a huge source for the image data. WordPress-hosted blogs on wp.com and wordpress.com represented 819k images together, or 6.8% of all images. Other photo, art, and blogging sites included 232k images from Smugmug, 146k from Blogspot, 121k images were from Flickr, 67k images from DeviantArt, 74k from Wikimedia, 48k from 500px, and 28k from Tumblr.

Shopping sites were well-represented. The second-biggest domain was Fine Art America, which sells art prints and posters, with 698k images (5.8%) in the dataset. 244k images came from Shopify, 189k each from Wix and Squarespace, 90k from Redbubble, and just over 47k from Etsy.

Unsurprisingly, a large number came from stock image sites. 123RF was the biggest with 497k, 171k images came from Adobe Stock’s CDN at ftcdn.net, 117k from PhotoShelter, 35k images from Dreamstime, 23k from iStockPhoto, 22k from Depositphotos, 22k from Unsplash, 15k from Getty Images, 10k from VectorStock, and 10k from Shutterstock, among many others.

It’s worth noting, however, that domains alone may not represent the actual sources of these images. For instance, there are only 6,292 images sourced from Artstation.com’s domain, but another 2,740 images with “artstation” in the caption text hosted by sites like Pinterest.

Artists

We wanted to understand how artists were represented in the dataset, so used the list of over 1,800 artists in MisterRuffian’s Latent Artist & Modifier Encyclopedia to search the dataset and count the number of images that reference each artist’s name. You can browse and search those artist counts here, or try searching for any artist in the images table. (Searching with quoted strings is recommended.)

Of the top 25 artists in the dataset, only three are still living: Phil Koch, Erin Hanson, and Steve Henderson. The most frequent artist in the dataset? The Painter of Light™ himself, Thomas Kinkade, with 9,268 images.

From a list of 1,800 popular artists, the top 10 found most frequently in the captioned images

Using the “type” field in the database, you can see the most frequently-found artists in each category: for example, looking only at comic book artists, Stan Lee’s name is found most often in the image captions. (As one commenter pointed out, Stan Lee was a comic book writer, not an artist, but people are using his name to generate images in the style of comic book art he was associated with.)

Some of the most-cited recommended artists used in AI image prompting aren’t as pervasive in the dataset as you’d expect. There are only 15 images that mention fantasy artist Greg Rutkowski, whose name is frequently used as a prompt modifier, and only 73 from James Gurney.

(It’s worth saying again that these images are just a subset of one of three datasets used to train the AI, so an artist’s work may have been used elsewhere in the data even if they’re not found in these 12M images.)

Famous People

Unlike DALL-E 2, Stable Diffusion doesn’t have any limitations on generating images of people named in the dataset. To get a sense of how well-represented well-known people are in the dataset, we took two lists of celebrities and other famous names and merged it into a list of nearly 2,000 names. You can see the results of those celebrity counts here, or search for any name in the images table. (Obviously, some of the top searches like “Pink” and “Prince” include results that don’t refer to that person.)

Donald Trump is one of the most cited names in the image dataset, with nearly 11,000 photos referencing his name. Charlize Theron is a close runner-up with 9,576 images.

Collage of generated portraits of Donald Trump and Charlize Theron from Stable Diffusion

A full gender breakdown would take more time, but at a glance, it seems like many of the most popular names in the dataset are women.

Strangely, enormously popular internet personalities like David Dobrik, Addison Rae, Charli D’Amelio, Dixie D’Amelio, and MrBeast don’t appear in the captions from the dataset at all. My hunch was that the CommonCrawl data was too old to include these more recent celebrities, but based on the URLs, there are tens of thousands of images from last year in the data. (If you can solve this mystery, get in touch or leave a comment!)

Fictional Characters

Finally, we took a look at how popular fictional characters are represented in the dataset, since this is subject matter that’s enormously popular using Stable Diffusion and Craiyon, but often impossible with DALL-E 2, as you can see in this Mickey Mouse example from my previous post.

“realistic 3d rendering of mickey mouse working on a vintage computer doing his taxes” on DALL·E 2 (left) vs. Stable Diffusion (right)

For this set of searches, we used this list of 600 fictional characters from pop culture to search the image dataset. You can browse the results here, or search for any other character in the images table. (Again, be aware that one-word character names like “Link,” “Data,” and “Mario” are likely to have many more results unrelated to that character.)

Characters from the MCU like Captain Marvel (4,993 images), Black Panther (4,395), and Captain America (3,155) are some of the best represented in the dataset. Batman (2,950) and Superman (2,739) are neck and neck. Luke Skywalker (2,240) has more images than Darth Vader (1.717) and Han Solo (1,013). Mickey Mouse barely breaks the top 100 with 520 images.

NSFW Content

Finally, let’s take a brief look at the representation of adult material, another huge difference between Stable Diffusion and any other model. OpenAI rigorously removed sexual/violent content from its training data and blocked potentially NSFW keywords from prompts.

The Stable Diffusion team built a predictor for adult material and assigned every image a NSFW probability score, which you can see in the “punsafe” field in the images table, ranging from 0 to 1. (Warning: Obviously, sorting by that field will show the most NSFW images in the dataset.)

In their announcements of the full LAION-5B dataset, LAION team member Romain Beaumont estimated that about 2.9% of the English-language images were “unsafe,” but in browsing this dataset, it’s not clear how their predictors defined that.

There’s definitely NSFW material in the image dataset, but surprisingly little of it. Only 222 images got a “1” unsafe probability score, indicating 100% confidence that it’s unsafe, about 0.002% of the total images — and those are definitely porn. But nudity seems to be unusual outside of that confidence level: even images with a 0.9999 punsafe score (99.99% confidence) rarely have nudity in them.

It’s plausible that filtering on aesthetic ratings is removing huge amounts of NSFW content from the image dataset, and the full dataset contains much more. Or maybe their definitions of what is “unsafe” are very broad.

More Info

Again, huge thanks to Simon Willison for working with me on this: he did all the heavy lifting of hosting the data. He wrote a detailed post about making the search engine if you want more technical detail. His Datasette project is open-source, extremely flexible, and worth checking out. If you’re interested in playing with this data yourself, you can use the scripts in his GitHub repo to download and import it into a SQLite database.

If you find anything interesting in the data, or have any questions, feel free to drop them in the comments.

Update

On December 20, 2023, LAION took down its LAION-5B and LAION-400M datasets after a new study published by the Stanford Internet Observatory found that it included links to child sexual abuse material. As reported by 404 Media, “The LAION-5B machine learning dataset used by Stable Diffusion and other major AI products has been removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material, 1,008 of which were externally validated.”

The subset of “aesthetic” images we analyzed was only 2% of the full 2.3 billion image dataset, and of those 12 million images, only 222 images were classified as NSFW. As a result, it’s unlikely any of those links go to CSAM imagery, but because it’s impossible to know with certainty, Simon took the precaution of permanently shuttering the LAION-Aesthetic browser.

Opening the Pandora’s Box of AI Art

Posted August 26, 2022January 12, 2023 by Andy Baio

Last month, I finally got access to OpenAI’s DALL·E 2 and immediately started exploring the text-to-image AI’s potential for creative shitposting, generating horror after horror: the Eames Lounge Toilet, the Combination Pizza Hut and Frank Lloyd Wright’s Fallingwater, toddler barflies, Albert Einstein inventing jorts, and the can’t-unsee “close up photo of brushing teeth with toothbrush covered with nacho cheese.”

DALL-E 2 rendered illustration of man with glasses working on a computer — “plasticine nerd working on a 1980s computer”

Detailed closeup realistic rendering of a human finger with a tiny gummy worm crawling on it with warm colors — “plasticine nerd working on a 1980s computer”

DALL·E 2 diligently hallucinated each image out of noise from the compressed latent space, multi-dimensional patterns discovered in hundreds of millions of captioned images scraped from the internet.

DALL-E 2 generated rendering of a macro photo of two slugs in the grass, one wearing a headdress of honeycomb and the other with stacked cottonballs on their head — “two slugs in wedding attire getting married, stunning editorial photo for bridal magazine shot at golden hour”

Rendering of a macro photo of two slugs drapes with golden cloths adorned with floral decorative elements — “two slugs in wedding attire getting married, stunning editorial photo for bridal magazine shot at golden hour”

The prompt that finally melted my brain was the one above, with images of slugs getting married at golden hour. I originally specified a “tuxedo and wedding dress” with predictable results, but changing it to “wedding attire” gave the AI the flexibility to depict variations of what slugs might marry in, like headdresses made of cotton balls and honeycomb.

I’ve never felt so conflicted using an emerging technology as DALL·E 2, which feels like borderline magic in what it’s capable of conjuring, but raises so many ethical questions, it’s hard to keep track of them all.

There are the many known issues that OpenAI’s acknowledged and worked to mitigate, like racial or gender biases in its image training set, or the lengths they’ve gone to avoid generating sexual/violent content or recognizable celebrities and trademarked characters.

But it opens profound questions about the ethics of laundering human creativity:

Is it ethical to train an AI on a huge corpus of copyrighted creative work, without permission or attribution?
Is it ethical to allow people to generate new work in the styles of the photographers, illustrators, and designers without compensating them?
Is it ethical to charge money for that service, built on the work of others?

There are basic fundamental questions about whether it’s even legal: these are largely untested waters in copyright law and it seems destined to end up in court. Training deep learning models on copyrighted material may be fair use, but only a judge can decide that. (The fact that OpenAI’s removing some results from the image training set, like celebrity faces and Disney/Marvel characters, suggests they’re well aware of angering the biggest litigants.)

A 3D rendered grey mouse wearing a t-shirt and working on a computer/typewriter hybrid — “realistic 3d rendering of mickey mouse working on a vintage computer doing his taxes” on DALL·E 2 (left) vs. Stable Diffusion (right)

a 3D rendering of Mickey Mouse sitting at a retro-like computer monitor and keyboard — “realistic 3d rendering of mickey mouse working on a vintage computer doing his taxes” on DALL·E 2 (left) vs. Stable Diffusion (right)

As these models improve, it seems likely to reduce demand in some paid creative services, from stock photography to commissioned illustrations. I empathize with the concerns of artists whose work was silently used to train commercial products in their style, without their consent and with no way to opt-out.

The world was just starting to grapple with the implications of this technology when, on Monday, a company called Stability AI released its Stable Diffusion text-to-image AI publicly.

Stable Diffusion is free, open-source, runs on your own computer, and ships without any of the guardrails and content filters of its predecessors. It comes with a Safety Classifier enabled by default that tries to determine if a generated image is NSFW, but it’s easily disabled.

Realistic rendering of Barack Obama kissing a somber Donald Trump's head — “Obama comforting Trump”

Series of photorealistic black-and-white rendered studio portraits of Scarlett Johansson — “Obama comforting Trump”

Unlike existing AI platforms like DALL·E 2 and Midjourney, Stable Diffusion can generate recognizable celebrities, nudity, trademarked characters, or any combination of those. (Try searching Lexica, the newly-launched Stable Diffusion search engine, for example output.)

Releasing an uncensored dream machine into the wild had some predictable results. Two days after its release, Reddit banned three subreddits devoted to NSFW imagery made with Stable Diffusion, presumably because of the rapid influx of AI-generated fake nudes of Emma Watson, Selena Gomez, and many others.

Screenshot of message explaining the "Stable Diffusion NSFW" subreddit was banned for violating Reddit's rules against non-consensual intimate media

The permissive license on Stable Diffusion allows commercial services to implement its AI model, such as NightCafe, which encourages paying customers to generate art in the styles of living artists like Pendleton Ward, Greg Rutkowski, Amanda Sage, Rebecca Sugar, and Simon Stålenhag, who has spoken out against the practice.

Screenshot from NightCafe with a list of artist names recommended as modifiers for art prompts — List of artist modifiers in NightCafe

On top of it, Stable Diffusion’s terms state that every image generated with their Dream Studio is effectively public domain, under the CC0 1.0 Public Domain license. They make no claim over the copyright of images generated with the self-hosted Stable Diffusion model. (OpenAI’s terms says that images created with DALL·E 2 are their property, with customers granted a license to use them commercially.)

A common argument I’ve seen is that training AI models is like an artist learning to paint and finding inspiration by looking at other artwork, which feels completely absurd to me. AI models are memorizing the features found in hundreds of millions of images, and producing images on demand at a scale unimaginable for any human—thousands every minute.

The results can be surprising and funny and beautiful, but only because of the vast trove of human creativity it was trained on. Stable Diffusion was trained on LAION-Aesthetic, a 120-million image subset of a 5 billion image crawl of image-text pairs from the web, winnowed down to the most aesthetically attractive images. (OpenAI has been more cagey about its sources.)

There’s no question it takes incredible engineering skill to develop systems to analyze that corpus and generate new images from it, but if any of these systems required permission from artists to use their images, they likely wouldn’t exist.

Stability AI founder Emad Mostaque believes the good of new technology will outweigh the harm. “Humanity is horrible and they use technology in horrible ways, and good ways as well,” Mostaque said in an interview two weeks ago. “I think the benefits far outweigh any negativity and the reality is that people need to get used to these models, because they’re coming one way or another.” He thinks that OpenAI’s attempts to minimize bias and mitigate harm are “paternalistic,” and a sign of distrust of their userbase.

Today we all made the World a more creative, happier and communicative place.

More to come in the next few days but I for one can’t wait to see what you all create.

Let’s activate humanity’s potential.
— Emad (@EMostaque) August 22, 2022

In that interview, Mostaque says that Stability AI and LAION were largely self-funded from his career as a hedge fund manager, and with additional resources, they’ve created a 4,000 A100 cluster with the support of Amazon that “ranks above JUWELS Booster as potentially the tenth fastest supercomputer.”

On Monday, Mostaque wrote that they plan to use those compute resources to expand to other AI-generated media: audio next month, and then 3D and video. I’d expect Stability AI to approach these new models in the same way, with little concern over their potential for misuse by bad actors, and with even less attention spent addressing the concerns of the artists and creators whose work makes them possible.

Like I said, I’m conflicted. I love playing with new technology, and I’m excited about the creative potential of these new tools. I want to feel good about the tools I use.

I don’t trust OpenAI for a bunch of reasons, but at least they seemed to try to do the right thing with their various efforts to reduce bias and potential harm, even if it’s sometimes clumsy.

Stable Diffusion’s approach feels irresponsible by comparison, another example of techno-utopianism unmoored from the reality of the internet’s last 20 years: how an unwavering commitment to ideals of free speech and anti-censorship can be deployed as a convenient excuse not to prevent abuse.

For now, generative AI platforms are some of the most resource-intensive projects in the world, leading to a vanishingly small number of participants with access to vast compute resources. It would be nice if those few companies would act responsibly by, at the very least, providing an opt-out for those who don’t want their work in future training data, finding new ways to help artists that do choose to participate, and following the lead of OpenAI in trying to minimize the potential for harm.

I don’t pretend to know where these things will go: the risks may be overblown and we may be at the start of a massive democratization in the creation of art, or these platforms may make the already-precarious lives of artists harder, while opening up new avenues for deepfakes, misinformation, and online harassment and exploitation. I’d really like to see more of the former, but it won’t happen on its own.

Waxy.org Turns 20

Posted April 14, 2022June 24, 2024 by Andy Baio

Hard to believe, but I started blogging 20 years ago today with this short post.

In my first ten years of writing, I published 415 posts and over 13,000 links. And in the last ten years, I published 136 posts and a little over 5,000 links, a pretty big drop from the ten years before.

There are some pretty obvious reasons why my posting slowed since 2012:

XOXO started that year, which became a big creative outlet for me, as well as a big time sink.
My long-form writing shifted elsewhere, with my column in WIRED and as a member of The Message publication on Medium, while short-form writing continued to land on Twitter.
I became more focused on quality than quantity, with a higher bar for what made it here.
I was less motivated to invest time in writing, in part because fewer people were reading.

I still enjoy writing though, and have no intention of stopping any time soon.

Ten years ago, I wrote a roundup of my favorite posts from my first decade of blogging, and I thought I’d do the same thing for 2012-2021. If you missed them the first time around, I hope you check them out this time. Looking back on the last ten years, I’m proud of so many of these pieces.

2012

Introducing Playfic. Announcing the launch of Playfic, a tool for writing and sharing Inform 7 interactive fiction games in the browser. Nearly 3,000 games have been published so far, I rounded up some highlights in 2013. (These days, I’d recommend using Borogove.)

The Perpetual, Invisible Window Into Your Gmail Inbox. I wrote about Unroll.me and similar apps that were quietly requesting access to all your email, an issue that exploded five years later when it was revealed they were selling user info to Uber, among others.

YouTube’s Content ID Disputes Are Judged by the Accuser. Raising awareness of YouTube’s end-run around the DMCA, which continues to be an issue today.

A Patent Lie: How Yahoo Weaponized My Work. This article blew up pretty big, in which I talk about how tech corporations encourage developers to patent their work, ostensibly for defensive purposes, only to find them used in litigation to stop innovation, popularizing the term “weaponized patents” in the process.

Instagram’s Buyout: How Does It Measure Up? Crunching the numbers on Instagram’s billion-dollar sale to Facebook against other notable acquisitions to see how it measured up. Instagram made $26 billion in ad revenue last year, more than Facebook itself, so a pretty smart deal.

Criminal Creativity: Untangling Cover Song Licensing on YouTube. Trying to unravel the surprisingly complicated question of whether a cover song uploaded to YouTube is infringement or not.

Introducing XOXO. Launched on Kickstarter, sold every ticket in 50 hours.

The Unified Theory of XOXO. Once the dust settled from the first XOXO, I wrote about what we were trying to do and the decisions we made — all of which are still part of the festival today.

2013

Aaron. Remembering Aaron Swartz.

The New Prohibition. Occasionally, my posts end up turning into conference talks, like in this Creative Mornings presentation.

The Death of Upcoming.org. I found out Yahoo was shutting down Upcoming like everyone else, with 11 days’ notice. With Archive Team’s help, we were able to collectively archive the vast majority of the site, allowing me to later restore nearly every event to its original URL.

Remembering XOXO 2013. Where we started really figuring things out.

Screens on Screen. A huge dump of fake computer screens in movies, and the projects that popped up around it.

GoldieBlox and the Three MCs. Copyright and fair use analysis of a repurposed parody of the Beastie Boys’ “Girls” for a toy commercial.

2014

Ellen DeGeneres’ “Walter Mitty” Screener Leaks Online. I was the first to report on this screener linked to a celebrity, which got coverage in Variety, Hollywood Reporter, Deadline, and many more.

‘JIF’ Is the Format. ‘GIF’ Is the Culture. Steve Wilhite may have designed the GIF format, but the looping animated GIF was a product of the web, invented eight years later.

72 Hours of #Gamergate. Analyzing over 316,000 tweets that mentioned #Gamergate to spot trends and visualize the network, including clear evidence that most supporters were using newly-created accounts.

Diary of a Corporate Sellout. A personal post about the risks that come from selling your startup when it’s also an online community. “When you sell the house, you’re not just selling a house. You’re selling everyone inside.”

How to Flawlessly Predict Anything on the Internet. I still love this post, explaining how a classic confidence scam could be adopted to social media with convincing results.

Playing With My Son. One of my all-time favorites, the story of playing videogame history with my son in (roughly) chronological order. I repurposed this one for a talk at Gel 2015, with my son in the front row.

2015

Pirating the 2015 Oscars: HD Edition. An interesting shift in screener leaks: pirates didn’t want them anymore because DVDs were increasingly considered poor-quality. “Pirates are now watching films at higher quality than the industry insiders voting on them.”

Never Trust A Corporation To Do A Library’s Job. My love letter to the Internet Archive, and Google’s failure to live up to their original mission statement to organize the world’s information.

If Drake Was A Piano. My experiments with converting MP3s to MIDI and back.

2016

Remembering XOXO 2016. 2016 was a busy year, between opening and closing the XOXO Outpost (our massive workspace for indie artists), working on the Upcoming reboot, and holding the fifth year of the festival. I didn’t get a lot of writing done.

Redesigning Waxy. I did squeeze in a redesign though, and some thoughts on blogging in 2016.

Creativity in a Post-Trump America. Just not a great year.

Go to Bed.

2017

The Long Cold Winter. Announcing the closure of the Outpost and the relaunch of Upcoming.

This Must Be The /r/Place. One of my favorite projects ever, led by future Wordle creator Josh Wardle.

Closing Communities: FFFFOUND! vs MLKSHK. Two very different approaches to shutting down an online community.

Pogo’s Politics. This post about Australian remix artist Pogo still gets traffic any time his name comes up, and people become aware of his repulsive views on women. “It’s hard to truly enjoy art made by someone you can’t respect.”

The Flagpole Sitta Lip Dub Turns 10. Reminiscing about a viral trend in the mid-2000s, and the video that helped popularize it.

You Think You Know Me. Announcing my wife Ami’s first card game, which I help edit and design, now published under the moniker Pink Tiger Games. Her fourth game, Lost for Words, is coming out later this year, this one co-designed with our son, Eliot. It’s turned into a real family business!

2018

A Tribute to YouTube Annotations. Six weeks before YouTube retired its annotations feature, I collected as many notable examples as I could find. Sadly, they’re all no longer interactive.

Demi Adejuyigbe at XOXO 2018. My only post about XOXO 2018, which was more than double the size in a new venue and absolutely exhausting, but still really memorable. Lizzo played the closing party and then sang karaoke with everyone! I regret not writing more about it while it was fresh in my mind.

Why You Should Never, Ever Use Quora. The most regressive archiving policy of any online community, it’s likely to be an epic loss of collected knowledge when they eventually close down.

2019

Dad. I don’t talk about my personal life often, but I sometimes make an exception for close friends and family I’ve lost.

Suck.com, Gone for Good (For Good). It’s been returning nothing but a PHP error (with database credentials!) for over a year.

Fast and Free Music Separation with Deezer’s Machine Learning Library. A series of AI audio experiments, an area I’ve been following for some time.

Turning Photos into 2.5 Parallax Animations with Machine Learning. A good excuse for me to learn how Google Colab notebooks work.

Unraveling the Mystery of “Visit Eroda,” The Tourism Campaign For An Island That Doesn’t Exist. A delightful ARG-like campaign that I followed in real-time as it developed, and the fascinating cultural divide between Harry Styles fans and ARG fans who didn’t want to believe.

How Artists on Twitter Tricked Spammy T-Shirt Stores Into Admitting Their Automated Art Theft. I want this post on a t-shirt.

2020

Paste Parties: The Ephemeral, Chaotic Joy of Random Clipboards. How I celebrate my birthday online every year: asking everyone to tweet me their unedited clipboards.

With questionable copyright claim, Jay-Z orders deepfake audio parodies off YouTube. The legal implications of AI-generated music are complex and fascinating.

OpenAI’s Jukebox Opens the Pandora’s Box of AI-Generated Music. Two days after my Jay-Z post, OpenAI released a neural network that could generate music in the style of various artists, with 7,100 song samples.

alt.binaries.images.underwater.non-violent.moderated: a deep dive. Solving a Usenet newsgroup curiosity, over 20 years later.

The House on Blue Lick Road. 2020’s best game was a 3D real estate listing of a sprawling hoarder house. I had to know more, so I picked up the phone and called the owner.

2021

Announcing Skittish. I spent all of last year working on Skittish, a virtual event space where you navigate the world as a little animal and talk to people near you with your microphone. It evolved quickly, hosted its first public events in June, and launched in November. I’m still working on it. You should check it out.

Colin’s Bear Animation, Revisited. Digging into the genealogy of a TikTok meme that bizarrely recreated the dance from Colin’s Bear Animation video, but with no other reference to the original.

Pirating the Oscars: Pandemic Edition. The pandemic really messed with my Oscar screener charts.

And that’s pretty much up to today. Thanks for sticking around and thanks for reading. See you in ten years?

In the Shadow of the Star Wars Kid

Posted March 31, 2022April 6, 2022 by Andy Baio

Last August, I entered a loft in downtown Portland, walked through a door, and met someone I’ve wanted to talk to for the last 20 years: Ghyslain Raza, the unwilling subject of the “Star Wars Kid” meme, the biggest viral video of the pre-YouTube era.

Since the video and its remixes exploded online in 2003, Ghyslain has refused all interview requests, except for the 10th anniversary of the video’s release in 2013 for an interview with a French-Canadian journalist for L’actualité magazine, which was translated into English for partner magazine Maclean’s.

But over the last couple years, he’s quietly worked with a group of documentary filmmakers to tell his story for the first time, in his own words. The full-length film was released today in French and English, as you’d expect from Quebec-based filmmakers. In English, it’s being released as Star Wars Kid: The Rise of the Digital Shadows, but I’m partial to the French title, Dans l’ombre du Star Wars Kid, which translates to “In the Shadow of the Star Wars Kid.” It feels much more fitting to the story they told.

It’s now available for streaming free from the National Film Board of Canada’s site, and I highly recommend watching it. I was lucky enough to get an advance screener and it’s a powerful film expertly told. Update: After a short window, the documentary can now only be viewed in Canada. No word on when it’ll be available elsewhere.

Making the Documentary

In February 2021, the documentary’s director, Mathieu Fournier, reached out to see if I’d speak to them about my role in the video’s initial spread and the fundraiser we held for him, my ultimately-futile attempt to shift the narrative to a positive light.

I’ve declined every interview request about this subject since 2003, but was surprised to hear that Ghyslain himself was deeply involved in the production, so I immediately agreed to participate.

The production team came to Portland for the filming, where I was interviewed for a couple hours by the filmmakers. Then, Ghyslain and I sat down for a long one-on-one conversation on camera about everything that happened 20 years ago, the impact it had on his life, and how he looks back on it now.

I’ve never talked about it publicly, but I regret ever posting it. From the start, it was obvious it was never meant to be seen, and mirroring it on my site without consent was wrong in a way that I couldn’t see when I was in my 20s, one year into blogging. I removed the videos once it was clear how it was affecting him, but I never should have posted them in the first place.

Meeting Ghyslain gave me the opportunity to tell him all of that in person, as well as in my interviews, some of which made it into the finished film.

As a side note, it was fascinating to get answers to questions I’ve wondered about for 20 years. Yes, Ghyslain actually received the iPod we sent him from the fundraiser, and used the gift cards we sent him to buy an iMac G4, both of which he kept to this day. He managed to avoid most of the remixes and media coverage, except for Arrested Development, which he watched live as it aired.

But more than anything, it was great to finally talk to him in person and see that he’s doing well. By all accounts, he handled everything that happened back then with a profound emotional maturity, despite how painful it was, and emerged on the other side with a uniquely interesting perspective that’s worth listening to.

Afterwards

After the documentary taping, we all met up for drinks on the roof deck at Revolution Hall, where we hold XOXO every year, and then went out for dinner and more drinks until late at night.

This time, Ghyslain and I were able to talk privately off camera, about our lives and families, about the Commodore 64 and typography, finding natural common ground. When he was younger, he was really into computers, but for obvious reasons, Ghyslain spent much of his life offline after 2003.

Like so many others, I saw my geeky teenage self when first watching the Star Wars Kid video, and sitting across from this 34-year-old man, I saw a parallel-world version of myself in my 30s. I first fell in love with the internet at age 15, the age Ghyslain was when he made the video.

That night, I couldn’t help but wonder how his life would have changed if it never happened. I was surprised to see that in the final film, there’s a moment where Ghyslain talks about our meeting, and wonders exactly the same thing. I hope you take the time to watch it.

Thanks to Ghyslain for his generosity and empathy, and thanks to the filmmakers for making this meeting possible: something I’ve quietly hoped would happen for 20 years.