Techie Tinkering: Audio meddling, transcription, summarisation.

Stronger success building a funky header image this time...

Whenever I have interviews with people, I always end up saying "I'm always tinkering with tech things in my spare time" - I thought it would be cool, on occasion, to share some details of the things I tinker with.

This project seemed like a good one to start with because, bluntly, I'm hip now. Scripting in Python? OpenAI APIs? Pheasant rearing? Get me to Shoreditch already.

Like all good stories, this story starts back in the 80s, when my extended family decided (in a very forward thinking way, I think) to record some of their conversations onto C90 cassette tapes. Some of the conversations are almost interviews, while others are just capturing normal day-to-day chatter.

My dad's recently taken possession of these tapes and has been dutifully spending his time digitising them, so they can be preserved for posterity...and listened to. I mean, who has a walkman these days?

Having done this, he's keen on having the tapes transcribed, so that one could scan through the content and see if there was anything interesting. That's where I come in. We found a few paid-for services to do this, but could I bodge something together for free that would do the job well?

Keep reading for the dramatic conclusion.

Initial impressions

So, I got me an example MP3 file to play around with. 45 minutes long, around 45MB in size. Let's have a listen.

First thought: This might be tricky.

Firstly, there's quite a lot of noise. A combination of tape hiss / white noise, and background noise from the household ("is that a washing machine in the background?").

Secondly, this isn't a studio recording, by a long shot. You can almost imagine the old cassette deck plonked in the middle of the table and people chattering away around it. Some voices are relatively close and clear, others are distant and indistinct.

Thirdly, my sample track focuses on my great-grandfather, who was an old man at the time. He's strongly accented, a little mumbly, uses a somewhat dated vernacular and is a bit scattered in the way he talks sometimes.

So, honestly, I went into this thing with "moderated expectations". But we'll see what we can do.

Noise reduction

As above, one of the purposes of this endeavour is that people might actually listen to these tapes...and with the high level of hiss on the track, it wasn't a joy to listen to...So I used Audacity to reduce that base noise.


Here's the noisy audio...Note the constant...noise.

Ah, the sweet sound of silence.

This worked really well, and delivered a near instant uplift in listening joy. Sorted.

Testing - will this ever work?

Being all about the "fail fast" mindset common amongst us hipsters, I decided to tackle uncertainty head on. Would any technical solution be able to make head or tail of the distance, accented muttering? Only one way to find out.

A few googles in, I came across OpenAI's Whisper API and decided to give this a whirl. Good documentation made it pretty easy to get going - and I already had an account anyway from tinkering with other things previously. There were a few steps to follow to make sure my API key was accessible, but all very easy stuff.

One limitations is that file size has to be < 25MB. So I used Audacity to chop my sample down to a size that would work.

And gave it a whirl... the results, immediately, were pretty good.
He was out in these great big meadows, and he was rearing all these hundreds and hundreds of young baby pheasants. Yeah. And then, we had incubators in the house, he used to put all these eggs in layers and layers till they hatched

There are obviously some errors, but you generally get what's being said.
Then from that, if they'd get broody ends, all encroached on rows and rows and rows, and go and put these chickens under these pheasants.

(I'm pretty sure the pheasants were put under broody hens)

There was also another quirk. The transcript just comes back as a continuous string of text. So you get questions and answered all muddled together:
Oh, yeah, I used to. How old were you when you were doing that? Oh, about 12 and a half, 13. Gosh. And the other brother used to do that as well, I suppose. Yeah, there was a... Helping your father, I suppose. Yeah. Gosh.

These foibles notwithstanding, it felt like the output was plenty good enough for the purposes of scanning through and understanding roughly what was being talked about.

So, I decided that yes, this approach would work. And set about making it a bit more robust.

Chunking audio

Next up, I decided to split the audio into 20min chunks programmatically, so that we could repeat this process on a whole bunch of files and not need to prattle around in Audacity so much.

This, again, looked simple. Everyone says you just need to use this pydub library, couple of lines of code, and you're golden.

Except, of course, it's the simple things that always take the longest... I think it's trying to work with python on Windows (or maybe VSCode specifically), but it took proper effort to get this thing to work. Many reddit threads later, several reboots and a lot of different permutations, I found the line that fixed it:

pydub.AudioSegment.ffmpeg = r"C:\<path>\ffmpeg.exe"

This location was already in PATH, I'd copied this executable local to the script, I'd already given the path to that file...but only explicitly pointing at the executable itself would make the thing work. But, once it worked, it was easy. Couple of lines of code, golden!

And with that, we had the start of our little pipeline put together:

#The orange file's connected to the....orange file.#
Summarisation

We had gotten to a place where every 20 minutes of audio now had a transcription output file associated with it...a great big blob of text. It worked, but I wondered if it really met the brief of enabling someone to "scan through the content and see if there's anything interesting. "

I thought it would be better to present a summary of what was discussed in each section...and thankfully, OpenAI have an API for that too:

summarization_prompt = f"Summarize the following text: '{input_text}'." # Request the summarization using ChatGPT summarization_response = openai.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": summarization_prompt}] )

But would this work? With all of the conversation, random words etc. Would it be able to meaningfully summarise the discussion? Turns out...yeah. Quite well.

The text is a conversation containing memories of family members and their personalities, childhood experiences, and past jobs. It includes details about family members, house locations, experiences with unemployment, and encounters with government officials. The conversation also touches on topics such as hunting, shooting animals, and medical treatments for tuberculosis.

I found this, probably, more impressive than the transcription itself. It was able to ignore all the foibles in the text that bother me, a human, and just summarise what was said. It also understood a lot of context and would summarise things in language I'm pretty sure my great-grandfather didn't use in the original text:

The speaker reflects on the impact of British imperialism...

So we'll add summarisation to our little script:
I've not bothered to find out how many API calls I get for free. As such, I've elected to keep outputting files at each step. Then, if I want to re-run the summary for a given transcription I can do that without having to re-transcribe.

Lists of people  

It occurred to me that if these tapes are preserved for posterity, the people listening to them are likely to have vested interest in trying to find information that relates to specific people. I'm quite interested in things that relate to my grandparents, less so in the wider family. So I wondered if there's an opportunity to just have ChatGPT list out the people that are discussed in each chunk. Yep:
name_prompt = f"List of all the names mentioned within the following text: '{input_text}'."

This worked...but not perfectly. Like all good computers, it did exactly what it was told to do. It output all the names it found, including "9. New Zealand, 10. Australia" and, most harrowingly of all... "1. Fortnite".

Not quite what I was after, so a minor tweak.name_prompt = f"List of all the names of people mentioned within the following text: '{input_text}'."

That worked better. It can stay:

We need to work on putting in some more loops so these diagrams take up less vertical real estate in the post.

De-chunking

Now that everything was working well, I figured we should start hardening the solution to make it ready to use on the remainder of the tapes.

The first thing I did, which is so obvious in retrospect, was to change the maximum chunk size from 20 minutes to 15 minutes.

Each side of a C90 is 45 minutes long...so the 20 minute limit yielded two 20 minute blocks and on e five minute block. I figured that having three 15 minute blocks had a couple of advantages:

  • It would smooth progress through the work. The transcription process, in particular, takes a few seconds to run through and it's a little black-box while it's happening. So having things starting and completing at relatively regular intervals is a bit more reassuring.
  • I thought I'd stitch the chunk summaries back together again to provide a summary of the whole audio file. Hopefully three equally sized files leads to three summaries that cover around a third of the material each, hopefully leading to a better overall summary.

I then created new files to combine all the transcripts, and summaries, together:

Seriously, this stuff's pretty easy. Think this diagram is longer than the code now.

The result was just as I hoped. A combined summary that covers all of the content, split into three chunks. Just about the right level of detail. If something piques your interest you can search the combined transcript to find out more.

I thought about combining the lists of people mentioned into a single list per audio file, to prevent duplication. But, actually, knowing which third of the recording a person is mentioned in is still quite useful...so let's not stress about that.

Final Loop

So, all that's needed now is to do this at scale, iterating through all the MP3 files in a given folder and doing the same steps:



I'm not a fan of black-box programs. But I'm quite fond of a grey box.

And the output is just as I expected:

Can you keep a secret? For testing purposes I just copied the same file and gave it different names. Hence the original file sizes being identical. Isn't it fun that the transcription size varies, though? I thought output would be constant!

I don't know, should I just stop producing the chunk files at all? I'm torn. I'm still interest in being able to mess around regenerating things on a per-chunk basis. And let's face it, it's pretty quick to delete all files with "chunk" in them if they annoy me. So let's just move on.

A glitch!

I thought everything was going swimmingly...then, on one file, an error.

My script was unable to write character \u0105 to file.

What's 0105? Oh... it's "ą".

I hadn't anticipated needing to write extended characters to my file...but changing the encoding is easy enough...so no biggy...

BUT - What would my great-grandfather have said that included "ą"? Someone's name? A place? I had to know...

I'm trying to lose a stone now, don't I? Are you? I'm trying to lose a stone. I will lose a stone. Now you're in the związook

"Now you're in the związook" - now that's a turn of phrase I'd not come across before!

Quick, to google!

Turns out "związook" isn't a word. Literally not a single google result. (Hey! I wonder if this blog post will come up if I search for it later?... I feel another post coming on!)

So, this leads to one conclusion: ChatGPT heard something it didn't understand. And it thought to itself "if I had to guess what that person said, I'd guess they said 'związook' and it would be spelt with a 'ą'".

Further confirmation, if more was needed, that ChatGPT can do some weird things sometimes. Please don't trust it to write your dissertation for you.

Final impressions

This whole thing has actually worked really well. From having very low confidence when I first heard a recording, I've actually ended up with a fairly good solution. We have detailed transcripts which, whilst not perfect, are probably "good enough" - and would certainly be a good head start if you wanted to transcribe by hand later.

And the summaries work really well - I've already started reading things in the summaries and going off to learn more about that particular topic in the transcript or audio...so the system works.

The whole thing was strung together pretty quickly. Good fun.

I'm sure there are ways I could lean on OpenAI to reduce the number of calls, or enhance the output further. But I'm happy that my solution is good enough for now. 

Goodbye!


As always, massive thanks to anyone that's bothered reading this far. You are the real ones, as we hipsters say!

PS

Alas, it turns out the free tier does, indeed, have a ceiling. Once I'd ran out, it became apparent that your trial account is pre-loaded with $5 of credit and once you burn it down you have to add more. So I had to add another $5 to complete the job.

Gave me a chance to look at the pretty graphs they show you of your usage, though...and to understand the respective costs. Turns out - transcription = expensive. Summarization = cheap.


Now my points of data make a beautiful line...