Audiobook Audio (V3) Part 2: The V3 Chain — by Eddie Tonkoi

The V3 Chain and The Few Decisions I Refuse to Wing

It’s not long ago I wrote a piece describing my older process—one that was ingenious, thought-through, bullet-proof, and, it turns out, flawed. It’s not that the old process was bad. It’s just that it turned out to be less efficient than I’d thought, and it also turned out I was looking at things backwards. In the past, I recorded my audio, did some processing, edited it, then continued the processing. I had bought iZotope RX11 and it performs near-miracles on poor audio to make it useful.

And the problem with that? It meant I’d accept poor audio and then try to fix it.

But having lived through and discarded a lengthy editing process, I now realise that repairing poor audio doesn’t give as good a result as polishing good audio. Who’d have thought?

How did I get here?

I was tidying up my audio, removing mouth clicks, those smacking sounds your mouth can make when it opens and closes, and after a while noticed something odd. Some of my consonants were disappearing as well. I dialled back the strength of the tool, and managed to get to a position that removed some clicks and no consonants. Happy, I moved on.

Then I noticed something else odd. I was removing breath noise, and some of my consonants were disappearing. Again. I dialled back the settings, tweaking them this way and that, but no matter what I did, I couldn’t save the consonants whilst removing the breath noise. In fact, the tool was identifying and cleaning my consonants before even touching the breath.

That’s what made me stop, pause, and reflect.
That’s when I realised I had this all wrong.
That’s when I decided to make this series—to push myself to understand properly what I’m doing.

Where am I?

I’ve spent several hours learning, testing, and learning some more. I like learning. After that time, I’m now in a place where I believe, from a position of understanding, these three things:
1. Audio should be recorded as close to the desired output as possible.
2. Editing should be done on that high-quality audio.
3. Processing should be kept to a minimum—ideally, only loudness being adjusted to meet specifications.

That means, ideally, no RX11 De-click, Voice De-noise, Mouth De-click, De-ess, or even EQ (though that’s probably going a bit too far). In practice, I’m not quite there, but I’m surprisingly close: only a gentle bit of denoising and Mouth De-click early on and a subtle EQ adjustment before Loudness Control.

More detail on the chain

It all now starts with me trying to get the best audio I can out of the microphone—which really means getting as close to the final product as I can. I’ll go into that properly in the next couple of segments, because this turns out to be the real lever. For now, all that’s relevant is that I record using Audio Hijack, which is simply a rock-solid app that saves an uncompressed 32-bit, 44.1 kHz audio file.
Oh, and I do have one confession.

Going against best practice, as if that’s a solid thing, I put trust in my microphone—the Shure MV7+—and I enable its internal De-noise functionality. Nothing else, but it does mean the audio I record is technically called dry not raw, because it has been processed. However, I have really tested this, and I cannot find any problems with its de-noising: no artefacts in the audio—and it reduces background noise. Purists would say I should turn it off and use RX11 instead, so that I hold onto a true raw file, but in blind tests, the MV7+ has seemed to do a better job than RX11. I’m keeping it on.

At this point, I run a gentle Mouth De-click using RX11, though I keep the original file as well in case I notice something I want to recover. It hasn’t happened yet. I actually do this as a plug-in to Audio Hijack as it records, so it doesn’t cost me any time.

I then import the file into Logic Pro, with each chapter being a separate track. Logic has a great function that removes silences, so I choose something quite conservative to get rid of the long pauses I sometimes have when a train goes past, or when I stop to read ahead. The settings I use look for places where the volume stays below -40 dB for at least 1.8 seconds, and then it trims that near-silence to just 1.8 seconds long. It doesn’t do much, but it saves me some time.

Now we’re onto the editing stage, which is lengthy. I listen through on the computer with my good headphones, and do just a few things, ideally:
1. Remove fluffs—where I said a line incorrectly, paused, and repeated it.
2. Shorten gaps between sentences or paragraphs, which could be due to me pausing to breathe or read ahead. Many of these will be that 1.8 seconds long now, which is always too long.
3. Adjust gaps between words and sentences. Especially as it is me speaking, I know the cadence of the text, and so I know when the next word is meant to land. If it noticeably misfires, I go in and make tiny edits to move the word forwards or backwards. It’s almost always forwards.
4. Reduce the volume of objectionable breathing or knocks. I do this in Logic by reducing the gain around the noise, aiming to make it less objectionable—but not aiming to remove it. Humans do breathe, after all.
5. Bounce the audio, which means I export it, giving me a rendered version of the edited track.
6. Adjust the loudness using RX11 to bring it to audiobook specifications.
7. Listen through on my iPhone, making note of anything that needs adjusting so I can go back into Logic to fix it.

Once all that is done, it’s just a question of deciding if the EQ is correct — or if I want to reduce the bass a little, or suppress a particular frequency (500 Hz and 5000 Hz can be a bit boomy for my voice).

Where this leaves me

This audio chain leaves me with a minimal number of tools. I know, it sounded quite long, but I have tried to make it minimal. I’ve stripped out de-essing and breath control, and switched to a light touch with other passes so that I can just run them and not worry about losing things. They’re simply not aggressive enough to cause problems, which also means they’re not aggressive enough to do much fixing.
But, here’s the point, what that chain really does is remove excuses—it shifts my time from repairing damage to enhancing something that’s already good.

The goal is simple: capture close to the target, and do as little as possible afterwards. And what that means is that if I’m not going to repair tone later, then tone has to be right at capture. And if tone has to be right at capture, then the biggest part of the “processing” isn’t a plugin at all—it’s microphone technique.

And that, is the subject for the next couple of segments: how I made mic position boring, repeatable, and reliable enough that I could stop guessing.

If you want to know more, come and ask me over in the Slack community at podfeet.com/slack, where I and all the other lovely NosillaCastaways enjoy friendly, positive online conversations. Feel free to message me, Eddie Tonkoi, if you have any thoughts, questions, or techniques you’re using. It would be nice to share ideas.

You can also find our work at jerntonkoi.com, where you’ll find Jern’s character-driven queer love stories, the audiobooks I produce for them, and bonus material for our subscribers.

I’ll be back soon to talk through some more of my workflow, but for now, happy recording, and happy reading.

The V3 Chain and The Few Decisions I Refuse to Wing

How did I get here?

Where am I?

More detail on the chain

Where this leaves me

Leave a Reply