Stronger success building a funky header image this time...
Whenever I have interviews with people, I always end up saying "I'm always tinkering with tech things in my spare time" - I thought it would be cool, on occasion, to share some details of the things I tinker with.
This project seemed like a good one to start with because, bluntly, I'm hip now. Scripting in Python? OpenAI APIs? Pheasant rearing? Get me to Shoreditch already.
Like all good stories, this story starts back in the 80s, when my extended family decided (in a very forward thinking way, I think) to record some of their conversations onto C90 cassette tapes. Some of the conversations are almost interviews, while others are just capturing normal day-to-day chatter.
My dad's recently taken possession of these tapes and has been dutifully spending his time digitising them, so they can be preserved for posterity...and listened to. I mean, who has a walkman these days?
Having done this, he's keen on having the tapes transcribed, so that one could scan through the content and see if there was anything interesting. That's where I come in. We found a few paid-for services to do this, but could I bodge something together for free that would do the job well?
Keep reading for the dramatic conclusion.
Initial impressions
So, I got me an example MP3 file to play around with. 45 minutes long, around 45MB in size. Let's have a listen.
First thought: This might be tricky.
Firstly, there's quite a lot of noise. A combination of tape hiss / white noise, and background noise from the household ("is that a washing machine in the background?").
Secondly, this isn't a studio recording, by a long shot. You can almost imagine the old cassette deck plonked in the middle of the table and people chattering away around it. Some voices are relatively close and clear, others are distant and indistinct.
Thirdly, my sample track focuses on my great-grandfather, who was an old man at the time. He's strongly accented, a little mumbly, uses a somewhat dated vernacular and is a bit scattered in the way he talks sometimes.
So, honestly, I went into this thing with "moderated expectations". But we'll see what we can do.
Noise reduction
As above, one of the purposes of this endeavour is that people might actually listen to these tapes...and with the high level of hiss on the track, it wasn't a joy to listen to...So I used Audacity to reduce that base noise.
Here's the noisy audio...Note the constant...noise.
Ah, the sweet sound of silence.
This worked really well, and delivered a near instant uplift in listening joy. Sorted.
Testing - will this ever work?
Being all about the "fail fast" mindset common amongst us hipsters, I decided to tackle uncertainty head on. Would any technical solution be able to make head or tail of the distance, accented muttering? Only one way to find out.
A few googles in, I came across OpenAI's Whisper API and decided to give this a whirl. Good documentation made it pretty easy to get going - and I already had an account anyway from tinkering with other things previously. There were a few steps to follow to make sure my API key was accessible, but all very easy stuff.
One limitations is that file size has to be < 25MB. So I used Audacity to chop my sample down to a size that would work.
And gave it a whirl... the results, immediately, were pretty good.
He was out in these great big meadows, and he was rearing all these hundreds and hundreds of young baby pheasants. Yeah. And then, we had incubators in the house, he used to put all these eggs in layers and layers till they hatched
There are obviously some errors, but you generally get what's being said.
Then from that, if they'd get broody ends, all encroached on rows and rows and rows, and go and put these chickens under these pheasants.
(I'm pretty sure the pheasants were put under broody hens)
There was also another quirk. The transcript just comes back as a continuous string of text. So you get questions and answered all muddled together:
Oh, yeah, I used to. How old were you when you were doing that? Oh, about 12 and a half, 13. Gosh. And the other brother used to do that as well, I suppose. Yeah, there was a... Helping your father, I suppose. Yeah. Gosh.
These foibles notwithstanding, it felt like the output was plenty good enough for the purposes of scanning through and understanding roughly what was being talked about.
So, I decided that yes, this approach would work. And set about making it a bit more robust.
Chunking audio
Next up, I decided to split the audio into 20min chunks programmatically, so that we could repeat this process on a whole bunch of files and not need to prattle around in Audacity so much.
This, again, looked simple. Everyone says you just need to use this pydub library, couple of lines of code, and you're golden.
Except, of course, it's the simple things that always take the longest... I think it's trying to work with python on Windows (or maybe VSCode specifically), but it took proper effort to get this thing to work. Many reddit threads later, several reboots and a lot of different permutations, I found the line that fixed it:
This location was already in PATH, I'd copied this executable local to the script, I'd already given thepathto that file...but only explicitly pointing at the executable itself would make the thing work. But, once it worked, it was easy. Couple of lines of code, golden!
And with that, we had the start of our little pipeline put together:
Summarisation
We had gotten to a place where every 20 minutes of audio now had a transcription output file associated with it...a great big blob of text. It worked, but I wondered if it really met the brief of enabling someone to "scan through the content and see if there's anything interesting. "
I thought it would be better to present a summary of what was discussed in each section...and thankfully, OpenAI have an API for that too:
summarization_prompt = f"Summarize the following text: '{input_text}'."
# Request the summarization using ChatGPT
summarization_response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": summarization_prompt}]
)
But would this work? With all of the conversation, random words etc. Would it be able to meaningfully summarise the discussion? Turns out...yeah. Quite well.
The text is a conversation containing memories of family members and their personalities, childhood experiences, and past jobs. It includes details about family members, house locations, experiences with unemployment, and encounters with government officials. The conversation also touches on topics such as hunting, shooting animals, and medical treatments for tuberculosis.
I found this, probably, more impressive than the transcription itself. It was able to ignore all the foibles in the text that bother me, a human, and just summarise what was said. It also understood a lot of context and would summarise things in language I'm pretty sure my great-grandfather didn't use in the original text:
The speaker reflects on the impact of British imperialism...
So we'll add summarisation to our little script:
Lists of people
It occurred to me that if these tapes are preserved for posterity, the people listening to them are likely to have vested interest in trying to find information that relates to specific people. I'm quite interested in things that relate to my grandparents, less so in the wider family. So I wondered if there's an opportunity to just have ChatGPT list out the people that are discussed in each chunk. Yep:
name_prompt = f"List of all the names mentioned within the following text: '{input_text}'."
This worked...but not perfectly. Like all good computers, it did exactly what it was told to do. It output all the names it found, including "9. New Zealand, 10. Australia" and, most harrowingly of all... "1. Fortnite".
Not quite what I was after, so a minor tweak.name_prompt = f"List of all the names of people mentioned within the following text: '{input_text}'."
That worked better. It can stay:
The result was just as I hoped. A combined summary that covers all of the content, split into three chunks. Just about the right level of detail. If something piques your interest you can search the combined transcript to find out more.
I thought about combining the lists of people mentioned into a single list per audio file, to prevent duplication. But, actually, knowing which third of the recording a person is mentioned in is still quite useful...so let's not stress about that.
Final Loop
So, all that's needed now is to do this at scale, iterating through all the MP3 files in a given folder and doing the same steps: