Fast and Free Music Separation with Deezer’s Machine Learning Library

Cleanly isolating vocals from drums, bass, piano, and other musical accompaniment is the dream of every mashup artist, karaoke fan, and producer. Commercial solutions exist, but can be expensive and unreliable. Techniques like phase cancellation have very mixed results.

The engineering team behind streaming music service Deezer just open-sourced Spleeter, their audio separation library built on Python and TensorFlow that uses machine learning to quickly and freely separate music into stems. (Read more in today’s announcement.)

You can train it yourself if you have the resources, but the three models they released already far surpass any available free tool that I know of, and rival commercial plugins and services. The library ships with three pre-trained models:

  • Two stems – Vocals and Other Accompaniment
  • Four stems – Vocals, Drums, Bass, Other
  • Five stems – Vocals, Drums, Bass, Piano, Other

It took a couple minutes to install the library, which includes installing Conda, and processing audio was much faster than expected.

On my five-year-old MacBook Pro using the CPU only, Spleeter processed audio at a rate of about 5.5x faster than real-time for the simplest two-stem separation, or about one minute of processing time for every 5.5 minutes of audio. Five-stem separation took around three minutes for 5.5 minutes of audio.

When running on a GPU, the Deezer team report speeds 100x faster than real-time for four stems, converting 3.5 hours of music in less than 90 seconds on a single GeForce GTX 1080.

Sample Results

But how are the results? I tried a handful of tracks across multiple genres, and all performed incredibly well. Vocals sometimes get a robotic autotuned feel, but the amount of bleed is shockingly low relative to other solutions.

I ran several songs through the two-stem filter, which is the fastest and most useful. The 30-second samples are the separations from the simplest two-stem model, with links to the original studio tracks where available.

🎶 Lizzo – “Truth Hurts”

Lizzo (Vocals Only)
Lizzo (Music Only)

Compare the above to the isolated vocals generated by PhonicMind, a commercial service that uses machine learning to separate audio, starting at $3.99 per song. The piano is audible throughout PhonicMind’s track.

🎶 Led Zeppelin – “Whole Lotta Love”

Led Zeppelin (Vocals Only)
Led Zeppelin (Music Only)

The original isolated vocals from the master tapes for comparison. Spleeter gets a bit confused with the background vocals, with the secondary slide guitar bleeding into the vocal track.

🎶 Lil Nas X w/Billy Ray Cyrus – “Old Town Road (Remix)”

Lil Nas X (Vocals Only)
Lil Nas X (Music Only)

Part of the beat makes it into Lil Nas X’s vocal track. No studio stems are available, but a fan used the Diplo remix to create this vocals-only track for comparison.

🎶 Marvin Gaye – “I Heard It Through the Grapevine”

Marvin Gaye (Vocals Only)
Marvin Gaye (Music Only)

Some of the background vocals get included in both tracks here, which is probably great for karaoke, but may not be ideal for remixing. Compare this to 1:10 in the studio vocals.

🎶 Billie Eilish – “Bad Guy”

Billie Eilish (Vocals Only)
Billie Eilish (Music Only)

I thought this one would be a disaster—the vocals are heavily processed and lower in the mix with a dynamic bass dominating the song—but it worked surprisingly well, though some of the snaps bleed through.

🎶 Van Halen – “Runnin’ With The Devil”

Van Halen – “Runnin’ With The Devil” (Vocals Only)
Van Halen – “Runnin’ With The Devil” (Music Only)

Spleeter had a difficult time with this one, but still not bad. You can compare the results generated by Spleeter to the famously viral isolated vocals by David Lee Roth, dry with no vocal effects applied.

Open-Unmix

The release of Spleeter comes shortly after the release of Open-Unmix, another open-source separation library for Python that similarly uses deep neural networks with TensorFlow for source separation.

In my testing, Open-Unmix separated audio at about 35% of the speed of Spleeter, didn’t support MP3 files, and generated noticeably worse results. Compare the output from Open-Unmix below for Lizzo’s isolated vocals, with drums clearly audible once they kick in at the 0:18 mark.

The quality issues can likely be attributed to the model released with Open-Unmix, which was trained on a relatively small set of 150 songs available in the MUSDB18 dataset. The team behind Open Unmix is also working on “UMX PRO,” a more extensive model trained on a larger dataset, but it’s not publicly available for testing.

What Now?

Years ago, I made a goofy experiment called Waxymash, taking four random isolated music tracks off YouTube, and colliding them into the world’s worst mashup. But I was mostly limited to a small number of well-known songs that had their stems leak online, or the few that could be separated cleanly with channel manipulation.

With processing speeds at 100 times faster than real-time playback on a single GPU, it’s now possible to turn all recorded music into a mashup or karaoke without access to the source audio. It may not be legal, but it’s definitely possible.

What would you build with it? I’d love to hear your ideas.

Thanks to Paige for the initial tip!

Updates

This thing is dangerously fun.

November 11. You can now play with Spleeter entirely in the browser with Moises.ai, a free service by Geraldo Ramos. After uploading an MP3, it will email you a link to download the stems.

Also, the Deezer team made Spleeter available as a Jupyter notebook within Google Colab. In my testing, larger audio files won’t play directly within Colab, and will need to be downloaded first to listen to.

Comments

    Thank you for this great post! Would this model/network lend itself to be used with live audio? If so, how would you approach it?

    @Onno: I tested it out on Cheap Trick’s “I Want You to Want Me” from Live at Budokan, and virtually none of the crowd noise carried over to the vocal-only stem. The crowd noise is all clearly audible on the music-only stem.

    Thank you for you quick response. I now see that I did not formulate my question very well. What I meant with ‘live’ is ‘real-time’. As in feeding the model/system with a continuous data-stream from a microphone. So that you can listen to ‘voice-only’ or ‘instrument-only’ while attending a live concert. Would that be possible?

    I’m hearing some weird aliasing — is that a byproduct of the process or are these mp3s really stepped on? Still, pretty friggin amazing.

    Any artifacts are a result of the separation process. That said, the model can be improved over time as it’s trained on additional sources. I expect these results to get noticeably better as time goes on.

    Amazing. I look forward to the day when isolating vocals can be done in something the size of a hearing aid, imagine how useful it would be for the hearing-impaired!

    @Otto: The model is designed to remove instrumentation, not audience noise. That said, it should be possible for someone to train a new model on audience-recorded concerts, and compare it with soundboard recordings from the same live shows, to effectively remove crowd sounds from live bootlegs.

    Hi Andy. More than six months since this comment. Has anyone trained the AI to remove audience noise?

    From my testing so far this only works with mp3 and wav, no success with flac and m4a, also the max output duration is 10 minutes, for all songs over 10 min they all get cut, anyway around that?

    Strange, FLAC worked fine for me. You can change the maximum duration with the “–max_duration” flag, which takes a value in seconds. (By default, it’s 600, so only generates the first ten minutes.)

    Hi Geraldo, could you change the max duration flag, or provide an option to do so when uploading?

    Post production audio avenues for this kind of separation is king for us dubbing mixers with no time to sit and do this manually – typical “day” TV mix times have been slashed worldwide – this algorithmical help is a godsend – stems are fine – money’s in the isolation of badly recorded dialogue – this, on these demo examples sounds better than Izotope and Acon’s stem splitting algorithm. ISST was interesting and have used to get better results than ERA Izotope and Acon and Waves – your shit sounds good ! /chapeau

    Hi Andy, thanks for this article, we are developers of the unique neural network which has been trained on 200TB of training data, that has over 45 million parameters.  Rolled out recently on the domain: https://www.lalal.ai. We will be happy if you can test some audio splitting and tell us your opinion 😉 Thanks once again!

    Andy, can you tell me please which content did you try to process and received such result?

    I tried a few songs including Billie Eilish’s “Bad Guy,” Smash Mouth’s “All Star,” Journey’s “Don’t Stop Believing,” and 10cc’s “The Things We Do for Love.” To my ear, Lalal’s output sounded really similar to Spleeter’s 16khz model, but slightly muddier and with more artifacts. Are you using Spleeter?

    Thanks for testing, we will increase the size of file for trying, so you can test in on wav and flac for better quality.

    Andy, we don’t use Spleeter code, Lalal.ai is our own trained neural network, we did a series of tests that show that our splitter is doing a better job, would you like to have a look at our research? I can send you it per email, just ping me 🙂

    I just ran Rush Subdivisions through both Spleeter and moises.ai. Night and day difference, Spleeter was absolutely superior on the vocal track. Moises would not let me download the rest of the separated tracks, so I can only assume the others are just as bad. I would give Spleeter 7/10, and Moises.ai 5/10.

Leave a Reply

Your email address will not be published. Required fields are marked *