Transcribing Speech to Text with Python and Google Cloud Speech API

Sample Results
1. Sign Up for a Free Tier Account
2. Generate an API Key
3. Convert Audio File to Wav format
4. Break up audio file into smaller parts
5. Install required Python modules
6. Running the Code
- The slow version
- Faster version
Conclusion

This tutorial will walk through using Google Cloud Speech API to transcribe a large audio file.

All code and sample files can be found in speech-to-text GitHub repo.

Transcribe large audio files using Python & our Cloud Speech API. @akras14 shows how https://t.co/dY56lmE0TD

— Google Cloud (@googlecloud) January 11, 2018

View Post

Sample Results

This approach works, but I found that result will vary greatly based on the quality of input.

Transcribing a Reading by My Wife

I asked my wife to read something out loud as if she was dictating to Siri for about 1.5 minutes. She is a native English speaker and we recorded using a microphone on iPhone 6s.

Which resulted in the following transcript:

00:00:00 this Dynamic Workshop aims to provide up to date information on pharmacological approaches, issues, and treatment in the geriatric population to assist in preventing medication-related problems, appropriately and effectively managing medications and compliance. The concept of polypharmacy parentheses taking multiple types of drugs parentheses will also be discussed, as the
00:00:30 is a common issue that can impact adverse side effects in the geriatric population. Participants will leave with a knowledge and considerations of common drug interaction and how to minimize the effects that limit function. Summit professional education is approved provider of continuing education. This course is offered for 6
00:01:00 . this course contains a Content classified under the both the domain of occupational therapy and professional issues.

I think that Google Cloud Speech API did an amazing job, getting over 95% of the content right. Especially considering that this was not a professional recording and that you can hear my kid saying something in the background 🙂

Transcribing a Radio Broadcast with Few Different Voices

A reader sent me the following audio file recorded from 95.5 Sports Hub radio (broadcast on January 26th 2018), Toucher & Rich morning show. This too, turned out better than I expected.

00:00:00 announced that there was going to be a new XXX FL it was going to start in two years and here’s what he had to say that you accept kickoff in 2020 quite frankly we’re going to give the game of football back to fans I’m sure everyone has a lot of questions for me but I also have a lot of questions for you in fact we’re going to ask a lot of questions and listen to players coaches
00:00:30 call experts technology executive members of the media and anyone else who understands and loves the game of football but most importantly we’re going to be listening to someone ask that the will the question of what would you do if you can reimagine the game of professional football would you frenchtons eliminate halftime would you have if you were commercial breaks but the game of foot
00:01:00 I’ll be faster when the rules be simpler can you ask Chef elevated fan Centric with all the things you like to see in the last of the things you don’t and no doubt a lot of Innovations along the way we will put you at a shorter faster-paced family-friendly and easier to understand game don’t get me wrong it’s still football but it’s professional football reimagined Sims 4 launching a 20
00:01:30 hey we have two years which is plenty of time to really get it right so aside from family friendly which I just think means that you have to stand for the national anthem I have no idea because the other one was very sex. That’s why is it either it was the cheerleaders with the super tight outfits and stuff cheerleaders were dressed and I stripped it sounds like a very good idea sounds like he has he has no plan no he does he’s taking everything he does have
00:02:00 and it said all the teams are going to be owned by the same entity he knows that they’re starting with a team and that they’re going to be shorter games with maybe no halftime with inferior Talent no not necessarily interior Town there’s already a saturation of football as is that is the biggest thing that people been complaining about the game what is he thinking you know what he said you ate yesterday you said we’re going to make it short and then we want your ideas no gimmicks all the things that God was just playing around
00:02:30 this does feel like a guy who’s had enormous prefer

Transcribing a Speech by Winston Churchill

I wanted to challenge the script further, so I decided to run in on a famous speech by Winston Churchill, titled The Threat of Nazi Germany.

Here is the audio file:

Which resulted in the following transcript:

00:00:00 many people think that the best way to escape War if the dwelling and then print them DVD for the younger generation they plump the grizzly photographs Before Their Eyes they feel that they dilate of generals and admirals they do not fit the crime I didn’t think they’d father
00:00:30 human strife how old is teaching in preventing us from attacking or invading any other country with the do so how would it help if we were attacked or invaded on stove that is a question we have to ask what did they does contempt of the Lord Beaverbrook
00:01:00 I’ll listen to the impassioned the field by George would they agree to meet that famous South African general identity I have bone responsibilities for the safety of this country in grievance time
00:01:30 we could convince and persuade them to go back play my play it seems to me you are rich we are what we are hungry it would be in Victoria’s we have been defeated you have valuable, we have not you have your name you have had the phone
00:02:00 set up pencil future about all I see are they would say you are weak and we are strong after all my friend your nephew all the way by that railing for nation of nearly 70 million the most educated industrial scientific discipline people in the world loving cup from childhood
00:02:30 all Epic Gloria Texas iron and death in battle at the noblest face for men yeah I need the nation we could have been done in order to augment its Collective Strength yeah definition of a group of preaching a gospel of intolerance and unrestrained by the wall by Parliament
00:03:00 public opinion in that country all packages speeches or morbid Wahlberg off of getting off the press I’m down you cable of Columbus they have a meeting dial shalt not kill it is the plenty of photos and or both now
00:03:30 play Ariana me with the upload speed I’m ready to that end lamentable weapon Javier against which all Navy is no defense and before which women and children so weak and frail capacity of the warriors on the front-line trenches all live equal adding partial patio
00:04:00 play with you but with the new weapon, new method of compelling the submission of racing bike terrorizing and torturing population and worst of all the more
00:04:30 the ball in cricket the structure of its social and economic life some more of those who may make it there praying love you too fat Grim despicable fact and invasive affect ionic again what are we to do

The result is an order of magnitude worse than my wife’s recording. Most likely it is caused by poor audio quality. In addition, Churchill used a lot of words that are no longer commonly used.

If you are still reading, let’s get started.

Google Cloud offers a Free Tier plan, which will be used in this tutorial. An account is required to get an API key.

2. Generate an API Key

Follow these steps to generate an API key:

Sign-in to Google Cloud Console
Click “APIs & Services”
Click “Credentials”
Click “Create Credentials”
Select “Service Account Key”
Under “Service Account” select “New service account”
Name service (whatever you’d like)
Select Role: “Project” -> “Owner”
Leave “JSON” option selected
Click “Create”
Save generated API key file
Rename file to api-key.json

Make sure to move the key into speech-to-text cloned repo, if you plan to test this code.

3. Convert Audio File to Wav format

I ran into issues when trying to convert my audio file via a command line tools. Instead, I used Audacity (an open source audio editing tool) to convert my file to wav format. Audacity is great and I highly recommended it.

The steps to convert:

Open file in Audacity
Click “File” menu
Click “Save other”
Click “Export as Wav”
Export it with default setting

4. Break up audio file into smaller parts

Google Cloud Speech API only accepts files no longer than 60 seconds. To be on the safe side, I broke my files in 30-second chunks. To do that I used an open source command line library called ffmpeg. It can be download from its site. On Mac, I installed it with Homebrew via brew install ffmpeg.

Here is the command I used to break up my file:

# Clean out old parts if needed via rm -rf parts/*
ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
Code language: PHP (php)

Where, source/genevieve.wav is the name of the input file, and parts/out%09d.wav is the format for output files. %09d indicated that the file number will be padded with 9 zeros (i.e. out000000001.wav), allowing files to be sorted alphabetically. This way ls command returns files sorted in the right order.

5. Install required Python modules

I added requirements.txt in example repo with all needed libraries. It can be used to install all via:

pip3 install -r requirements.txt
Code language: CSS (css)

The real hero on this list is the SpeechRecognition. It does most of the heavy lifting.

The rest of the libraries came with the official google-api-python-client package.

I also used tqdm module to show progress in the slower version of the script.

6. Running the Code

Finally, we can run the Python script to get the transcript. For example python3 fast.py.

The slow version

Here is the Github link.

This script:

Loads API key from step 2 in memory
Gets a list of files (chunks)
For every file, calls speech to text API endpoint
Adds results to a list
Combines all results and adds a timestamp (every 30 seconds)
Saves results to transcript.txt

import os
import speech_recognition as sr
from tqdm import tqdm

with open("api-key.json") as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

r = sr.Recognizer()
files = sorted(os.listdir('parts/'))

all_text = []

for f in tqdm(files):
    name = "parts/" + f
    # Load audio file
    with sr.AudioFile(name) as source:
        audio = r.record(source)
    # Transcribe audio file
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    all_text.append(text)

transcript = ""
for i, t in enumerate(all_text):
    total_seconds = i * 30
    # Cool shortcut from:
    # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
    # to get hours, minutes and seconds
    m, s = divmod(total_seconds, 60)
    h, m = divmod(m, 60)

    # Format time as h:m:s - 30 seconds of text
    transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t)

print(transcript)

with open("transcript.txt", "w") as f:
    f.write(transcript)
Code language: PHP (php)

The code works, but it does take a while on longer source files.

Faster version

To speed things up, I added threading to my slow version. I describe the method used in detail in Simple Python Threading Example post.

Here is the GitHub Link.

The main difference is that I moved processing into a function and added logic, in the end, to sort processed results in the right order.

import os
import speech_recognition as sr
from tqdm import tqdm
from multiprocessing.dummy import Pool
pool = Pool(8) # Number of concurrent threads

with open("api-key.json") as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

r = sr.Recognizer()
files = sorted(os.listdir('parts/'))

def transcribe(data):
    idx, file = data
    name = "parts/" + file
    print(name + " started")
    # Load audio file
    with sr.AudioFile(name) as source:
        audio = r.record(source)
    # Transcribe audio file
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    print(name + " done")
    return {
        "idx": idx,
        "text": text
    }

all_text = pool.map(transcribe, enumerate(files))
pool.close()
pool.join()

transcript = ""
for t in sorted(all_text, key=lambda x: x['idx']):
    total_seconds = t['idx'] * 30
    # Cool shortcut from:
    # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
    # to get hours, minutes and seconds
    m, s = divmod(total_seconds, 60)
    h, m = divmod(m, 60)

    # Format time as h:m:s - 30 seconds of text
    transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t['text'])

print(transcript)

with open("transcript.txt", "w") as f:
    f.write(transcript)
Code language: PHP (php)

Conclusion

Results may vary, but there is utility even in poor transcriptions. For example, I had an hour and a half audio recording from a hand-over meeting with my former co-worker. I remembered that he mentioned something at some point, but was dreading listening through 1.5-hour audio file to find it. I ran the recording through this script and was able to quickly find needed keywords and timestamp pointed me to the right part of the audio file.

For native English speakers like my wife, Google Cloud Speech API can easily replace a professional transcribing service, at a fraction of a cost.

Table of Contents