Cleanly isolating vocals from drums, bass, piano, and other musical accompaniment is the dream of every mashup artist, karaoke fan, and producer. Commercial solutions exist, but can be expensive and unreliable. Techniques like phase cancellation have very mixed results.
The engineering team behind streaming music service Deezer just open-sourced Spleeter, their audio separation library built on Python and TensorFlow that uses machine learning to quickly and freely separate music into stems. (Read more in today’s announcement.)
You can train it yourself if you have the resources, but the three models they released already far surpass any available free tool that I know of, and rival commercial plugins and services. The library ships with three pre-trained models:
- Two stems – Vocals and Other Accompaniment
- Four stems – Vocals, Drums, Bass, Other
- Five stems – Vocals, Drums, Bass, Piano, Other
It took a couple minutes to install the library, which includes installing Conda, and processing audio was much faster than expected.
On my five-year-old MacBook Pro using the CPU only, Spleeter processed audio at a rate of about 5.5x faster than real-time for the simplest two-stem separation, or about one minute of processing time for every 5.5 minutes of audio. Five-stem separation took around three minutes for 5.5 minutes of audio.
When running on a GPU, the Deezer team report speeds 100x faster than real-time for four stems, converting 3.5 hours of music in less than 90 seconds on a single GeForce GTX 1080.
Sample Results
But how are the results? I tried a handful of tracks across multiple genres, and all performed incredibly well. Vocals sometimes get a robotic autotuned feel, but the amount of bleed is shockingly low relative to other solutions.
I ran several songs through the two-stem filter, which is the fastest and most useful. The 30-second samples are the separations from the simplest two-stem model, with links to the original studio tracks where available.
🎶 Lizzo – “Truth Hurts”
Compare the above to the isolated vocals generated by PhonicMind, a commercial service that uses machine learning to separate audio, starting at $3.99 per song. The piano is audible throughout PhonicMind’s track.
🎶 Led Zeppelin – “Whole Lotta Love”
The original isolated vocals from the master tapes for comparison. Spleeter gets a bit confused with the background vocals, with the secondary slide guitar bleeding into the vocal track.
🎶 Lil Nas X w/Billy Ray Cyrus – “Old Town Road (Remix)”
Part of the beat makes it into Lil Nas X’s vocal track. No studio stems are available, but a fan used the Diplo remix to create this vocals-only track for comparison.
🎶 Marvin Gaye – “I Heard It Through the Grapevine”
Some of the background vocals get included in both tracks here, which is probably great for karaoke, but may not be ideal for remixing. Compare this to 1:10 in the studio vocals.
🎶 Billie Eilish – “Bad Guy”
I thought this one would be a disaster—the vocals are heavily processed and lower in the mix with a dynamic bass dominating the song—but it worked surprisingly well, though some of the snaps bleed through.
🎶 Van Halen – “Runnin’ With The Devil”
Spleeter had a difficult time with this one, but still not bad. You can compare the results generated by Spleeter to the famously viral isolated vocals by David Lee Roth, dry with no vocal effects applied.
Open-Unmix
The release of Spleeter comes shortly after the release of Open-Unmix, another open-source separation library for Python that similarly uses deep neural networks with TensorFlow for source separation.
In my testing, Open-Unmix separated audio at about 35% of the speed of Spleeter, didn’t support MP3 files, and generated noticeably worse results. Compare the output from Open-Unmix below for Lizzo’s isolated vocals, with drums clearly audible once they kick in at the 0:18 mark.
The quality issues can likely be attributed to the model released with Open-Unmix, which was trained on a relatively small set of 150 songs available in the MUSDB18 dataset. The team behind Open Unmix is also working on “UMX PRO,” a more extensive model trained on a larger dataset, but it’s not publicly available for testing.
What Now?
Years ago, I made a goofy experiment called Waxymash, taking four random isolated music tracks off YouTube, and colliding them into the world’s worst mashup. But I was mostly limited to a small number of well-known songs that had their stems leak online, or the few that could be separated cleanly with channel manipulation.
With processing speeds at 100 times faster than real-time playback on a single GPU, it’s now possible to turn all recorded music into a mashup or karaoke without access to the source audio. It may not be legal, but it’s definitely possible.
What would you build with it? I’d love to hear your ideas.
Thanks to Paige for the initial tip!
Updates
This thing is dangerously fun.
November 11. You can now play with Spleeter entirely in the browser with Moises.ai, a free service by Geraldo Ramos. After uploading an MP3, it will email you a link to download the stems.
Also, the Deezer team made Spleeter available as a Jupyter notebook within Google Colab. In my testing, larger audio files won’t play directly within Colab, and will need to be downloaded first to listen to.