Today, research laboratory OpenAI announced Jukebox, a sophisticated neural network trained on 1.2 million songs with lyrics and metadata, capable of generated original music in the style of various artists and genres, complete with rudimentary singing and vocal mannerisms.
The Jukebox AI can generate new music in a genre or artist’s style, guided with lyrics and an optional audio prompt, or completely unguided.
Note that Jukebox doesn’t generate lyrics: it can only sing lyrics when they’re provided as input. Without lyrics for guidance, Jukebox generates nonsensical vocal utterances in the style of the original singer. (The lyrics in the Curated Samples section of the Jukebox announcement were generated with an unrelated language model, GPT-2, and used as playful sample input text.)
The resulting work is a clear leap forward in musical quality, though it comes with some limitations.
“While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music.
For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat.”
Just digging around the sample library, I found so many intriguing examples. It’s the uncanny valley of music: machine-hallucinated melodies and nonsensical DeepDream-esque vocals, but often capturing the style and mannerisms of the artist it’s trying to mimic.
In this example, the Jukebox AI is fed the lyrics from Eminem’s “Lose Yourself” and told to generate an entirely new song in the style of Kanye West.
With no lyrics for guidance, the AI tries to generate an entirely new David Bowie song. Have fun making out the lyrics!
Again, with no lyrics to guide it, the AI tries to generate an entirely new Prince song. I asked Anil Dash about it, and he said it sounded like it was trained heavily on Prince’s 2000s-era work.
A.I.-generated Al Green is pretty listenable. If the audio fidelity was better, I’d put this on at a dinner party. The machine-generated vocal utterances (you can’t really call them lyrics) are nonsense, but it hardly matters.
In one of the stranger examples, the OpenAI researchers fed the lyrics of Avril Lavigne’s “Dumb Blonde” to the model—and told it to make a Talking Heads song, complete with David Byrne’s vocal mannerisms.
For the Continuations collection, researchers prompted the AI with the real lyrics and first 12 seconds of the original song, and then just… let it loose. Listen to this version of David Bowie’s “Space Oddity” that rapidly goes sideways once the leash is off.
I wonder if this is what Let It Be-era Beatles sounds like to people who hate the Beatles and/or don’t speak English.
Find any great ones in the collection? Post a comment with your favorites.
Unfortunately, making your own songs won’t be as easy. While the code is available, OpenAI says it takes three hours to render 20 seconds of audio on an NVIDIA Tesla V100, a $10,000 GPU. You can experiment with it on Google Colab for short, low-quality samples, but rendering times and memory limits may make it challenging.
Just two days ago, I wrote about how Jay-Z ordered two deepfaked audio parodies off YouTube, the first known example of someone claiming copyright over an AI voice impersonation and the first time YouTube removed a video for it.
One of the OpenAI researchers on the project addressed the legality question directly, stating that they believe training the AI on copyrighted material is fair use, but sought clarification from the U.S. Patent and Trademark Office for clarification.
But what about using the AI to generate new music? If I make a new album of Britney Spears songs, in her style and in her voice, who owns the copyright for that work?
I’d refer to the discussion of copyright and fair use from my earlier post, which applies here across the board. In short, it depends on how it’s used.
New music generated from a corpus of copyrighted music by a single artist may be considered a derivative work, in which case, only the original elements would be protected by copyright—and what constitutes “original” in this context? Machine-generated melodies and lyrics? The vocal performance? We’re in untested legal waters.
While there’s no federal law for personality rights, many states have recognized the right to control your likeness for commercial use, either by common law or statutes. In one notable example from 1988, Bette Midler was able to win her case against Ford Motor for their use of a sound-alike singer in advertising.
But typically, personality rights statutes would only apply to commercial uses, and not the wide array of non-commercial use for creative remixing.
Even if it’s found to be copyright infringement, the use of AI-generated music for parody, criticism, and commentary should be protected under fair use, but only a court can decide that on a case-by-case basis.
The Future Is Here
In Robin Sloan’s first novella, Annabel Scheme, a quantum computer populates a massive file server with music that never existed in this dimension.
Until this year, Annabel Scheme’s file server was the stuff of science fiction.
With the release of OpenAI’s Jukebox, the future is here and the world of music just got much, much weirder.
On Friday, I linked to several videos by Vocal Synthesis, a new YouTube channel dedicated to audio deepfakes — AI-generated speech that mimics human voices, synthesized from text by training a state-of-the-art neural network on a large corpus of audio.
“Over the past few months, the creator of the channel has trained dozens of speech synthesis models based on the speech patterns of various celebrities or other prominent figures, and has used these models to generate more than one hundred videos for this channel. These videos typically feature a synthetic celebrity voice narrating some short text or a speech. Often, the particular text was selected in order to provide a funny or entertaining contrast with the celebrity’s real-life persona.
“For example, some of my favorites are George W. Bush performing a spoken-word version of “In Da Club” by 50 Cent, or Franklin Roosevelt’s powerful rendition of the Navy Seals Copypasta.
“The channel was created by an individual hobbyist with a huge amount of free time on his hands, as well as an interest in machine learning and artificial intelligence technologies. He would like to emphasize that all of the videos on this channel were intended as entertainment, and there was no malicious purpose for any of them.
“Every video, including this one, is clearly labeled as speech synthesis in both the title and description. Which brings us to the reason why we’re delivering this message.
“Over the past two days, several videos were posted to the channel featuring a synthetic Jay-Z rapping various texts, including the Navy Seals Copypasta, the Book of Genesis, the song “We Didn’t Start the Fire” by Billy Joel, and the “To Be Or Not To Be” soliloquy from Hamlet.
“Unfortunately, for the first time since the channel began, YouTube took down two of these videos yesterday as a result of a copyright strike. The strike was requested by Roc Nation LLC, with the stated reason being that it, quote, “unlawfully uses an AI to impersonate our client’s voice.”
“Obviously, Donald and I are both disappointed that Jay-Z and Roc Nation have decided to bully a small YouTuber in this way. It’s also disappointing that YouTube would choose once again to stifle creativity by reflexively siding with powerful companies over small content creators. Specifically, it’s a little ironic that YouTube would accept “AI impersonation” as a reason for a copyright strike, when Google itself has successfully argued in the case of “Authors Guild v. Google” that machine learning models trained on copyrighted material should be protected under fair use.”
No Intent to Deceive
At its core, the controversy over deepfakes is about deception and disinformation. Earlier this year, Facebook and Twitter banned deepfakes that could mislead or cause harm, largely motivated by their potential impact on the 2020 elections.
Though it’s worth nothing that the use of deepfakes for fake news is largely theoretical so far, as Samantha Cole covered for VICE, with most created for porn. (And, no, Joe Biden sticking his tongue is not a deepfake.)
In this case, there’s no deception involved. As he wrote in his statement, every Vocal Synthesis video is clearly labeled as speech synthesis in the title and description, and falls outside of YouTube’s guidelines for manipulated media.
Copyright and Fair Use
With these takedowns, Roc Nation is making two claims:
These videos are an infringing use of Jay-Z’s copyright.
The videos “unlawfully uses an AI to impersonate our client’s voice.”
But are either of these true? With a technology this new, we’re in untested legal waters.
The Vocal Synthesis audio clips were created by training a model with a large corpus of audio samples and text transcriptions. In this case, he fed Jay-Z songs and lyrics into Tacotron 2, a neural network architecture developed by Google.
It seems reasonable to assume that a model and audio generated from copyrighted audio recordings would be considered derivative works.
But is it copyright infringement? Like virtually everything in the world of copyright, it depends—on how it was used, and for what purpose.
It’s easy to imagine a court finding that many uses of this technology would infringe copyright or, in many states, publicity rights. For example, if a record producer made Jay-Z guest on a new single without his knowledge or permission, or if a startup made him endorse their new product in a commercial, they would have a clear legal recourse.
But, as the Vocal Synthesis creator pointed out, there’s a strong case to be made this derivative work should be protected as a “fair use.” Fair use can get very complicated, with different courts reaching different outcomes for very similar cases. But there are four factors judges use when weighing a fair use defense in federal court:
The purpose and character of the use.
The nature of the copyrighted work.
The amount and substantiality of the portion taken.
The effect of the use upon the potential market.
There’s a strong case for transformation with the Vocal Synthesis videos. None of the original work is used in any recognizable form—it’s not sampled in a traditional way, using an undisclosed set of vocal samples, stripped from their instrumentals and context, to generate an amalgam of the speaker.
And in most cases, it’s clearly designed as parody with an intent to entertain, not deceive. Making politicians rap, philosophers sing pop songs, or rappers recite Shakespeare pokes fun at those public personas in specific ways.
Vocal Synthesis is an anonymous and non-commercial project, not monetizing the channel with advertising and no clear financial benefit to the creator, and the impact on the market value of Jay-Z’s discography is non-existent.
There are questions about the amount and substantiality of the borrowed work. But even if the model was trained on everything Jay-Z ever produced, it wouldn’t necessarily rule out a fair use defense for parody.
Ultimately, there are two clear truths I’ve learned about fair use from my own experiences: only a court can determine fair use, and while it might be a successful defense, fair use won’t protect you from getting sued and the costs of litigating are high.
Interviewing the Creator
As far as I know, this is the most prominent example of a celebrity claiming copyright over their own deepfakes, the first example of a musician issuing a takedown of synthesized vocals, and according to the creator, the first time YouTube’s removed a video for impersonating a voice with AI. (Previously, Conde Nast took down a Kim Kardashian deepfake by claiming copyright over the source video, and Jordan Peterson ordered a voice simulator offline.)
I reached out to the anonymous creator of Vocal Synthesis to learn more about how he makes these videos, his reaction to the takedown order, and his concern over the future of speech synthesis. (Unfortunately, Roc Nation didn’t respond to a request for comment.)
How do you feel about the takedown order? Were you surprised to receive it? I was pretty surprised to receive the takedown order. As far as I’m aware, this was the first time YouTube has removed a video for impersonating a voice using AI. I’ve been posting these kind of videos for months and have not had any other videos removed for this reason. There are also several other channels making speech synthesis videos similar to mine, and I’m not aware of any of them having videos removed for this reason.
I’m not a lawyer and have not studied intellectual property law, but logically I don’t really understand why mimicking a celebrity’s voice using an AI model should be treated differently than someone naturally doing an (extremely accurate) impression of that celebrity’s voice. Especially since all of my videos are clearly labeled as speech synthesis in both the title and description, so there was no attempt to deceive anyone into thinking that these were real recordings of Jay-Z.
Can you talk a little about the effort that goes into generating a new model? For example, how long does it typically take to gather and train a new model until it sounds good enough to publish? Constructing the training set for a new voice is the most time-consuming (and by far the most tedious) part of the process. I’ve written some code to help streamline it, though, so it now usually takes me just a few hours of work (it depends on the quality of the audio and the transcript), and then there’s an additional 12 hours (approximately) needed to actually train the model.
Are you using Tacotron 2 for synthesis? Yeah, I’m using fine-tuned versions of Tacotron 2.
I saw you’ve struggled getting enough dialogue to fully develop some models, like with Mr. Rogers. Have there been other voices you’ve wanted to synthesize, but it’s just too challenging to find a corpus to work from? Yeah, several. Recently I tried to make one for Theodore Roosevelt, but there’s only about 30 minutes of audio that exists for him (and it’s pretty poor quality), so the model didn’t really come out well.
The Crocodile Hunter (Steve Irwin) is another one I really want to do, and I can find enough audio, but I haven’t been able to find any accurate transcripts or subtitles yet (it’s very tedious for me to transcribe the audio myself).
How do you decide the voices and dialogue to pair together? I try to consistently have all my voices read the Navy Seals Copypasta and the first few lines of the Book of Genesis, since it’s easier to hear the nuances of each voice when I can compare them to other voices reading the same text. Other than that, there’s no real method to it. If I have an idea for voice/text combination that I think would be funny or interesting enough to be worth the effort of making the video, then I’ll do it.
What do these videos mean to you? Is it more of a technical demonstration or a form of creative expression? I wouldn’t really consider my videos to be a technical demonstration, since I’m definitely not the first to make realistic speech synthesis impersonations of well-known voices, and also the models I’m using aren’t state-of-the-art anymore.
Mainly, I’m just making these videos for entertainment. Sometimes I just have an idea for a video that I really want to exist, and I know that if I don’t make it myself, no one else will.
On the more serious side, the other reason I made the channel was because I wanted to show that synthetic media doesn’t have to be exclusively made for malicious/evil purposes, and I think there’s currently massive amounts of untapped potential in terms of fun/entertaining uses of the technology. I think the scariness of deepfakes and synthetic media is being overblown by the media, and I’m not at all convinced that the net impact will be negative, so I hoped that my channel could be a counterexample to that narrative.
Are you worried about the legal future for creative uses of this technology? Sure. I expect that this technology will improve even more over the next few years, both in terms of accuracy and ease of use/accessibility. Right now it seems to be legally uncharted waters in some ways, but I think these issues will need to be settled fairly soon. Hopefully the technology won’t be stifled by overly restrictive legal interpretations.
It seems inevitable that, at some point, an artist’s voice is going to be used for other uses against their will: guesting on a track without permission, promoting products they aren’t paid for, or maybe just saying things they don’t believe. What would you say to artists or other public figures who are worried that this technology will damage their rights and image? There are always trade-offs whenever a new technology is developed. There are no technologies that can be used exclusively for good; in the hands of bad people, anything can be used maliciously. I believe that there are a lot of potential positive uses of this technology, especially as it gets more advanced. It’s possible I’m wrong, but for now at least I’m not convinced that the potential negative uses will outweigh that.
Update: I just heard from Vocal Synthesis’s creator that the copyright strike was removed, and bothvideos are back on his channel. I initially suspected that Roc Nation dropped the copyright claim, but Nick Statt at The Vergereported that Google reviewed the DMCA takedowns.
“After reviewing the DMCA takedown requests for the videos in question, we determined that they were incomplete,” a Google spokesperson tells The Verge. “Pending additional information from the claimant, we have temporarily reinstated the videos.”
If Roc Nation provides the missing information to complete the DMCA requests, the videos will go offline again. Or, given the press coverage, they may choose to let it go. We’ll see!
Yesterday was my birthday, and like I’ve done for the last four years, I posted a single tweet that instantly destroyed my mentions for over 24 hours.
That tweet kicked off a paste party with over 2,000 replies, a potpourri of pure chaos and joy.
Random strings from emails and chat, passwords and 2FA tokens to unknown apps, screenshots and photos, obscure Unicode characters, dollar amounts from spreadsheets, bits of text in languages from Python to Esperanto, and so many links to articles, songs, videos, tweets, and obscure web pages.
It’s a momentary snapshot of digital ephemera, to be used and immediately discarded, much of it never meant to be seen by anyone and stripped of all context.
I first saw this idea in a private file-sharing/discussion community, and tried it on Twitter back in 2012, giving away copies of games and movies to people who replied with the contents of their clipboard. (Those attempts netted 14 and 24 replies, respectively, but Twitter won’t show threaded replies for older tweets.)
But the idea goes back much further. Discussion forums and message boards have played variations of the “Ctrl+V Game” (or “Ctrl+V Threads”) since at least the early 2000s. Some of them ran for years, like this 12-year-long thread from Ants Marching with 4,500 replies.
The earliest examples I found are this Usenet thread from May 2001 (thanks, Ben!) and this thread from October 2001, but pre-2001 digital archives are hard to search these days. I wouldn’t be surprised if this idea went back to forums, Usenet, and BBSes in the ’80s or ’90s. (Add a comment if you know more!)
Without context, everything seems more mysterious. You wonder what it meant, or why someone had it in their clipboard.
It’s a great way to discover interesting links to music, video, articles, and web pages, because if it was in someone’s clipboard, it probably means they found it interesting enough to send to someone.
Our clipboards show temporary glimpses of work in progress, whether it’s art, design, or code.
And so many good videos.
It’s also a snapshot of a moment in time: we’re at the height of a global pandemic, and our clipboards reflect it in the content we’re copying.
This tiny peek into everyone’s lives — their work, interests, and concerns, or even just the mundane momentary ephemera that’s forgotten two seconds later — is the perfect birthday gift.
Three years ago, my wife Ami designed and developed her first game, a charming conversational card game called You Think You Know Me, which went on to sell over 9,000 copies around the world and now close to selling out its second print run.
I loved helping out with the package and card design for You Think You Know Me, a return to my pre-web career in desktop publishing and print production, as well as making the official homepage to support it. (The cards are all CSS!)
The followup to her first game is Flatter Me, a new game where you compete with friends to give compliments, with rules similar to the classic card game of War. It takes literally seconds to learn, explained in full in the project video below.
Each of the 250 cards have a unique compliment on them, which you can give away as little tokens of affection.
Once again, I helped out with the packaging and card designs, and if it hits its goal, you can expect to see a site at flatterme.cards once it’s officially on sale.
I know I’m biased, but Ami’s games have a gentle sweetness that really resonates with me. They’re all designed to bring people together, whether it’s by learning more about people you love or simply by telling them how much they mean to you.
Her games have rules and win conditions like any other card game, but they’re so quick and easy to understand that they become a convenient framework to enrich the connections between friends, family, and partners.
Flatter Me is now funding on Kickstarter, currently at 95% funded (!) with three days to go, and I’d love it if you checked it out or helped spread the word. Thanks!
If you’ve ever looked at the replies on any newsworthy amateur video posted to Twitter, you’ll see an inevitable chorus of news organizations and broadcast journalists in the replies, usually asking two questions:
Did you shoot this video?
Can we use it on all our platforms, affiliates, etc with credit?
I’ve returned regularly since Corey launched it and, as expected, it’s a powerful way of tracking a particular type of breaking news: visual stories with footage captured by normal people at the right place and right time.
Much of it is of interest only to local news channels: traffic accidents, subway mishaps, a wild animal on the loose, the occasional building fire.
But frequently, Bbbreaking News shows the impact of gun violence and climate change: a near-constant stream of active shooter scenarios, interspersed with massive brush fires, catastrophic flooding, and extreme weather events.
It’s a fascinating way to see the stories that broadcast media is currently tracking and viewing their sources before they can even report on it, captured by the people stuck in the middle.
I recommend checking it out. Thanks to Corey for running with the idea and saving me the effort of building it myself!