Dirty, Fast, and Free Audio Transcription with YouTube

Five years ago, I wrote about how I transcribe audio with Amazon’s Mechanical Turk, splitting interviews into small segments and distributing the work among dozens of anonymous people. It ended up as one of my most popular posts ever, continuing to draw traffic and comments every day.

Lately, I’ve been toying with a free, fast way to generate machine transcriptions: repurposing YouTube’s automatic captions feature.

How It Works

Every time you upload a video, YouTube tries to generate a caption file. If there’s audible text, you can grab a subtitle file within a few minutes of uploading the video.

But how’s the quality? Pretty mediocre! It’s about as good as you’d expect from a free machine-generated transcript. The caption files have no punctuation between sentences, speakers aren’t broken out separately, and errors are very common.

But if you’re transcribing interviews, it’s often easier to edit a flawed transcript than starting from scratch. And YouTube provides a solid interface for editing your transcript audio and getting the results in plaintext.

I used TunesToTube, a free service for uploading MP3s to YouTube, to upload the first 15 minutes of our New Disruptors interview, with permission from Glenn Fleishman.

It took about 30 seconds for TunesToTube to generate the 15-minute-long video, three seconds to upload it, and about a minute for the video to be viewable on my account.

It takes a bit more time for YouTube to generate the audio transcriptions. Testing in the middle of a weekday, it took about six minutes to transcribe a two-minute video, and around 30 minutes for the 15-minute video. Fortunately, there’s nothing you need to do while it processes. Just upload and wait.

I ran a number of familiar film monologues through the YouTube’s transcription engine, and the results vary from solid to laughably bad. I’ve posted the videos below with the automatic transcription and their actual text.

As you’d expect, it works best with clear enunciation and spoken word. Soft words over background music, like in the Breakfast Club clip, falls apart pretty quick. But some, like Independence Day, aren’t terrible.


good morning less than an hour aircraft from here will join others from around the world and you will be launching largest aerial battle in the history of mankind man China word should have new meaning for almost a day we can be consumed by amp petty differences anymore we will be united in our common interest perhaps space today’s 4th July and you will once again be fighting for our freedom not from tyranny and oppression and persecution from annihilation we’re fighting for our right to live to exist and should we win the day the fourth of July will no longer be known as an American holiday but as the day when the world declared in one voice we will not go quietly into the night we will not Nash fine we’re going to survive ok today we celebrate I independence

Good morning. In less than an hour, aircraft from here will join others from around the world and you will be launching the largest aerial battle in the history of mankind. Mankind. That word should have new meaning for all of us today. We can’t be consumed by our petty differences anymore. We will be united in our common interest, perhaps it’s fate that today’s the fourth of July and you will once again be fighting for our freedom. Not from tyranny, oppression or persecution, but from annihilation. We’re fighting for our right to live, to exist. And should we win the day, the fourth of July will no longer be known as an American holiday but as the day when the world declared, in one voice, we will not go quietly into the night! We will not vanish without a fight! We’re going to live on, we’re going to survive! Today, we celebrate our Independence Day!

he didn’t say it he didn’t do it wouldn’t you agree your highness technicality that will shortly be reminded but first things first to death to the pay quite familiar with that phrase unexplained and I’ll use small search will be sure to understand you warthogs faced soon maybe you the first time in my life a man is dead insult me it won’t be the last to the pain is the first thing you lose will be your feet below the ankles in your hands at the race next who knows and then my tongue I suppose I killed you to quickly the last time a mistake I don’t mean to duplicate tonight I wasn’t finished the next thing you lose will be left I followed by a right and then my ears I understand let’s get on with it your is you keep not tell you why so that every shriek every child seeing your hideous this will be used to church every baby that weeps your approach every woman who cries out did god what is that thing will echo in your perfect is that is what the pain means it means I’d leave you in anguish wallowing freakish misery for I think you’re bluffing possible paying I might be bluffing conceivable you miserable comment is mass moneyline because I lacked the strength to stand and perhaps I have the strength after troop your so the sea up

“He didn’t say it, he didn’t do it. Wouldn’t you agree, your highness?” “A technicality that will shortly be remedied. But first things first. To the death!” “No! To the pain.” “I don’t think I’m quite familiar with that phrase.” “I’ll explain and I’ll use small words so that you’ll be sure to understand, you warthog-faced buffoon.” “That may be the first time in my life a man has dared insult me.” “It won’t be the last. To the pain means the first thing you lose will be your feet below the ankles, then your hands at the wrist, next your nose.” “And then my tongue, I suppose. I killed you too quickly the last time, a mistake I don’t mean to duplicate tonight.” “I wasn’t finished! The next thing you lose will be your left eye followed by your right.” “And then my ears, I understand! Let’s get on with it.” Wrong! Your ears you keep, and I’ll tell you why. So that every shriek of every child at seeing your hideousness will be yours to cherish. Every babe that weeps at your approach, every woman who cries out ‘dear god! what is that thing’ will echo in your perfect ears. That is what ‘to the pain’ means. It means I leave you in anguish, wallowing in freakish misery forever.” “I think you’re bluffing.” “It’s possible, pig. I might be bluffing. It’s conceivable, you miserable vomitous mass, I’m only lying here because I lack the strength to stand. Then again, perhaps I have the strength after all. Drop your sword. Have a seat.”

let me have your attention for a moment you talk about what talking about bitching about that sell you shot some sort of a bitch don’t wanna buy land somebody know what we sell some broad trying to screw so forth let’s talk about something important here one I’m gone anyway let’s talk about something important put that coffee down coffee’s for closers on is a gunfight with you I’m not fucking I’m here from downtown I’m here from mention Murray and I’m here on a mission of mercy James Levine you call yourself a salesman’s edge analyst ship you certainly don’t POW does the good news is you’re fired the bad news is you’ve got all you’ve got just one week to regain your job starting with tonight’s starting with tonight’s sit I’ll I got your attention now good was wearing a little something to this month sales contest as you all know first prize Cadillac anybody wanna see second prize second prizes third prizes you fight picture laughing now God leads mention Murray paid good money get their names to sell them you can’t close the leisure giving you can’t call shit you shit the bricks POW and beat it cuz you are going out the leads are weak leads week fuckin reads all week you’ll I’ve been in this business fifteen years what’s your name you that’s my name you know why mister as you drove a Hyundai to get here tonight I drove an eighty thousand dollar BMW thats my name

Blake: You’re talking about what. You’re talking about… Bitching about that sale you shot, some sonofabitch who don’t wanna buy land, some broad you’re trying to screw, so forth. Let’s talk about something important. Are they all here?

Williamson: All but one.

Blake: I’m going anyway. Let’s talk about something important. Put. That coffee. Down. Coffee’s for closers only. You think I’m fucking with you? I am not fucking with you. I’m here from downtown. I’m here from Mitch and Murray. And I’m here on a mission of mercy. Your name’s Levine? You call yourself a salesman you son of a bitch?

Dave Moss: I don’t gotta sit here and listen to this shit.

Blake: You certainly don’t pal, ’cause the good news is – you’re fired. The bad news is – you’ve got, all of you’ve got just one week to regain your jobs starting with tonight. Starting with tonight’s sit. Oh? Have I got your attention now? Good. ‘Cause we’re adding a little something to this month’s sales contest. As you all know first prize is a Cadillac El Dorado. Anybody wanna see second prize? Second prize is a set of steak knives. Third prize is you’re fired. Get the picture? You laughing now? You got leads. Mitch and Murray paid good money, get their names to sell them. You can’t close the leads you’re given, you can’t close shit. You are shit. Hit the bricks pal, and beat it ’cause you are going OUT.

Shelley Levene: The leads are weak.

Blake: The leads are weak? Fucking leads are weak. You’re weak. I’ve been in this business 15 years…

Dave Moss: What’s your name?

Blake: Fuck you. That’s my name. You know why, mister? You drove a Hyundai to get here tonight. I drove an eighty-thousand dollar BMW. That’s my name.

give mister burning we accept the fact that we had to sacrifice a whole Saturday attention for every Wed but we think you created to make it rain and you think we are she is as you want here in simplest terms mostly definition applied we found out is that each one this is a brain an act great and a basket case a princess and the crime then she questioned into yours practically done the

Dear Mr. Vernon, We accept the fact that we had to sacrifice a whole Saturday in detention for whatever it was we did wrong, but we think you’re crazy to make us write an essay telling you who we think we are. You see us as you want to see us. In the simplest terms and the most convenient definitions. But what we found out is that each one of us is a brain, and an athlete, and a basketcase, a princess, and a criminal. Does that answer your question? Sincerely yours, the Breakfast Club.


Obviously, this is no replacement for human transcription, but potentially a good starting point for your own transcription efforts, or used to feed Mechanical Turk. Paying someone to edit a flawed transcript may be easier than starting from nothing. Let me know if you end up playing around this.

40 thoughts on “Dirty, Fast, and Free Audio Transcription with YouTube

  1. Google Voice creates audio transcripts of voice mail–and emails the text to you. I wonder if it uses the same engine as YouTube–or if it’s in any way better optimized for voice?

    Downside of Google Voice as a transcription method certainly is the requirement of real time “input”–e.g., playing an audio file in real time over the phone into a voice mail. And there may be a length limit… but might be another method worth trying.

  2. I just tried this. And it didnt work. I tried videos of length 13 min and 20 min. My guess is that maybe because the main person speaking had an accent? otherwise no idea.

  3. I had the very same idea earlier today. Then I found your post. For lectures, which I have thousands of hours of (example https://www.youtube.com/watch?v=b11AXknrsEI), it is pretty close. I am also interested in getting the time index of the phrases also. I found something that gets the xml out of it. I’ll see what happens.

  4. What about audio files – no video? I have a bunch of recordings of my dad (now deceased) telling stories about his life. Does YouTube need video?

  5. Well, I answered my own question. I tried it, and the file was too big (it’s a 30-minute recording).

  6. When I tried this, it didn’t automtically caption the speech, it just gave me the option to transcribe it my self. What gives? Maybe because it was classified as a music file? So I changed that, but still it didn’t automatically caption.

    If only there was a way to obtain youtube’s machine translation software and bypass the upload process altogether!

  7. The uSubtitle service is not free but it does provide an automated speaker independent, speech to text service for media files along with the tools to correct the text. We have a free tool for transcribing media at uTranscribe.tv but it does not offer Speech to Text.

  8. OK – it worked for me, but I can’t seem to download the file in a readable format. Any tips?????

  9. Lerissa: use Window’s Live movie maker to change ur audio into a video. Simply upload any one pic in the movie maker and then upload ur audio file. After that make sure that you click ‘fit to music’ so that the picture is screeched across ur audio file. Save as the video on ur disk and ready to upload on YouTube.

  10. Lerissa: use Window’s Live movie maker to change ur audio into a video. Simply upload any one pic in the movie maker and then upload ur audio file. After that make sure that you click ‘fit to music’ so that the picture is screeched across ur audio file. Save as the video on ur disk and ready to upload on YouTube.

  11. To All those whose caption didn’t work:

    After uploading ur video, it is most likely that YouTube will not give you the “English (Automatic)” language option. That could be due to many reasons. But don’t worry, I have couple of tricks that I use and that work. 1. On YouTube in the video manger area, go to edit and click ‘Enhancement’. That will take upto an hour for youtube to enhace ur video depending on the size of ur video. After couple of hours you should get the “English (Automatic)” language option. Trick 2. Leave ur video over night and next day you should get the automatic language option

  12. It was useful in a way that allows me to type and does not play untill I stop typing. but did not transcrib it for me. I need an autho transcription.

    But thanks any ways

  13. It was useful in a way that allows me to type and does not play untill I stop typing. but did not transcrib it for me. I need an autho transcription.

    But thanks any ways

  14. Wow, I’m surprised how accurate the Independence Day “Bill Pullman (or is it Paxton?) speech transcript came out. Whenever I’ve uploaded a video with myself or one of my friends talking, even if the background is pretty near silent, the YouTube transcript comes out more like The Princess Bride one. Maybe it’s our weird voices. Can definitely see how this is a good way to get a machine transcription for free if you have a video with great sound clarity and little or no background noise (and perhaps a voice as clear as Lone Starr’s up there). 🙂

  15. My video has the captions, but how can I download them into one long text format for editing? I’m only able to see them as they appear on the screen so far… How did you do it?

  16. The machine transcription capability is far from being perfect. Transcriberly.com has a machine that does transcription at about 65%-95% accuracy depending on the quality and number of speakers. Like mentioned above, it isn’t so much about being perfect, but rather giving you a head start so you don’t have to type every word.

  17. Works great to get me started with a script that I can work from.

    I also took it one step further in terms of the captioning:

    I downloaded the time-coded file, cleaned it up and replaced it. Then I took the file, pasted the contents into Google Translate and generated a Spanish version. A little more editing (at the minimum you need to fix how it mangles the time-coding) and voilà… Spanish subtitles!

  18. hey! i am really trying to generate cations for some lecture videos but not happening! the audio quality is pretty clear!

  19. I’ve seen services that allow you to rip a YouTube video, but I haven’t seen a tool that can simply grab the transcript from a video instead.

    Andy or anyone else here, do you know of any kind of tool that you can point a YouTube url (even if it’s not your own) at and have it generate the transcript for you to download?

    It looks like the tools described here only allow you to work with transcriptions from videos that are your own, right?

  20. Trish, sorry it’s late in the day for this but to put the text of the video into a text document not word. Upload the text document to YouTube and it will automatically sync the text to speech, something that it will not do with a Word document.

    If anybody would like assistance with Video transcription which is a cost-effective way to instantly improving your SEO ranking click this link here: http://bit.ly/1mPCHsa

    If you would like me to transcribe your audio follow this link: http://bit.ly/258fUKg

    I touch type (do not need to look at text or just hear audio and type) speed of between 70 and 80 words per minute and maintain a regular accuracy rate of 98%. Please visit http://www.virtuadmin.uk where you can find lots of social media resources, articles, hints & tips and results of all typing tests I’ve undertaken on a regular basis.

  21. Given how painful transcribing audio is, people repeatedly ask us why there is still no software that can automatically take an audio and spit out its transcribed text with good accuracy. Now, it’s not entirely true that there is no such software – there are many, but they don’t help in transcribing real-world audio which typically involves handling multiple voices and all kinds of background noises.

    Everybody has got their own style of speaking

    Training a machine to recognize human voice has proven to be very difficult due to the variations in how people speak a particular language. Despite being the most widely spoken language, English itself sounds considerably different in various parts of the world.

    Even if everyone spoke a language the exact same way, there is still the added difficulty of training the system for different voices – from young to old to male to female to hoarse to soft to – you get the drift. Even the same person tends to speak differently in different situations, for example, during a moment of excitement or a bout of cold. Let’s not also forget that some people speak faster, while others speak slower, with lots of ums, ers and uhs, which aren’t even part of any spoken language! Arriving at a speech model that can handle all these variations (like humans do) is really tricky.

    Ambiguity caused by homophones and word boundaries

    Interpreting speech requires a good understanding of the overall context. Humans are gifted with the ability to interpret fuzzy data and automatically deduce the missing parts based on the context. Machines are really bad at disambiguating the meaning of words and phrases as they lack the ability to comprehend the bigger picture.

    A homophone is a word that is pronounced the same way as another word but differing in meaning. For example, let’s take the following two phrases:

    the sail of a boat

    the sale of a boat

    The words “sail” and “sale” sound the same, and there is no way to distinguish between the two without first understanding the overall context. Such a “context” might not even be available till later on in the speech!

    Human speech tends to be continuous with no natural pauses between words. This poses a difficult challenge: where should a waveform be split to form meaningful words? Given a sequence of sounds, realigning the sounds to form different word boundaries can produce vastly different sentences:

    It’s not easy to wreck a nice beach.

    It’s not easy to recognize speech.

    It’s not easy to wreck an ice beach.

    Once again, an accurate transcription requires an understanding of what the speaker is trying to say in the context of the full speech.

    If the words are spoken slowly, with a clear pause after every word, machines stand a better chance. This is another reason why today’s technology is better at handling dictation and transcription of short sentences or commands than conversational audio.

    Time and resource intensive

    Speech recognition is an incredibly complex and resource intensive process. One requires a lot of tagged audio samples to train the system to recognize the plethora of variations in human speech. The fact that there are, at the very least, a quarter of a million distinct English words does not help. In addition, storing and processing such large amounts of high quality audio samples requires significant engineering resources.

    So, are we stuck?

    Based upon what I’ve read, and my experience, it takes a minimum of 4-5 hours to transcribe 1 hour of digitally recorded interviews. Now the real question is, what’s your time worth?

    Transcription process takes precision and accuracy, commitment and dedication, focus and patience..

    My suggestion is to use one of transcription services like http://www.GoTranscript.com

    They are used by Harward, Stanford.. The fact that such top universities trust with their transcripts proves quality credentials.

  22. Thanks for this! I uploaded my videos to YouTube and waited a couple of hours and the automatic captions were generated.

    Now to figure out how to get them out in a usable (for my purpose) format. The download button includes timecodes which I don’t really need, but I can open the .SBV file in a text editor quite nicely. (.SRT and .VTT have extra stuff in it I don’t need)

    KeepSubs no longer seems to exist. Oh well, a project for next week! At least now I don’t have to type out every word myself!

Comments are closed.