Building a Fiction Editing Pipeline with Regex and Python — Part 1 by Eddie Tonkoi

Intro: Rationale and Regex

My wife, Jern Tonkoi, is a discovery writer. She knows what she wants to write, but the characters and the locations don’t always play along, resulting in emotional depth with creative freedom.

I have the task of editing and producing the novel, and so I am bound by mundane conventions such as spelling, and grammar, and rules. The things that are not so much fun.

We are in a lucky position, therefore, where I can take on tasks such as spell-checking and grammar-checking using my technical chops, leaving Jern to focus, unhindered, on her story-telling. We’re playing into our strengths.

In this short series of articles I’d like to describe my process in the hope that you might find some use of it, or at least be interested in how I have tackled this problem.

The Problem to be Solved

It’s always good to cover the problem to be solved early on, especially when you might be thinking that spell-checking and grammar-checking were solved a long, long time ago. Whilst this is true to a large extent, there were a few things I felt were lacking from the various solutions I’d looked at such as Word, Grammarly, ProWritingAid, LibreOffice and some others. Not an exhaustive hunt by any means, but enough for me to learn what I want.

I want it to make suggestions to me, and never, ever make changes itself.
I want it to analyse the whole manuscript, not as we go along. In other words, Jern has no analysis on her iPad as she writes, I will do all the analysis on my Mac when she passes control over to me.
I want it to remember my custom words, custom phrases, and things I have chosen to ignore within a single book.
I want it to be highly performant on a 100,000 word novel.
I want it to be transferrable across machines, so I can use my desktop or notebook computer.
I want it to be consistent over time, so that the same rules are applied today as will be applied in 5 years’ time.
I want it to be free, or at least non-subscription since we don’t want this eating into the challenging sales of indie fiction.

The problems I found with the commercial options ranged from clunky interfaces, to sluggish performance on a Mac, to losing track of what I had chosen to ignore, to struggling to maintain a custom dictionary (let alone on a per-book basis), to freebies pushing their upsells (which I don’t blame them for, I just prefer to avoid that).

I also knew that AI was not going to be my friend here. By demanding consistency and control, I am almost by definition turning away from an opinionated AI that could start hallucinating at the drop of a pin.

With those niggles and a penchant for learning new tools, I decided to make my own checker. And that was the start of ETNA, Eddie Tonkoi’s Narrative Analysis tool.

Engineering Design

I like designing things, and so I looked into what tools might be available to help me out. I already had an idea that some of this might be possible using regex to find problems with punctuation and typographic errors. I’d also looked at using LibreOffice and seen how it can use a free extension for something called LanguageTool to power up its spelling and grammar checker.

So, I had the nugget of an idea and a subscription to ChatGPT, so I put it into Deep Research mode and asked it to come up with a plan of how to turn my Mac into an editor’s leviathan.

To be clear, Jern continues to write on her iPad in Scrivener unencumbered by any of this. In fact, I do not see a line of her book until she has finished it.

Our process is that she writes in Scrivener, with the file held in Dropbox. This also syncs to my Mac, so that my Mac can backup the Scrivener file every hour and every night. Once she finishes writing, I remove the file from Dropbox and place it solely on my Mac (with backup still running, of course), meaning the ownership has transferred to me. I then read, run ETNA, and do my editor things. I then move it back to Dropbox and Jern does her edits.

The important part for today’s tale, is that I will be running my spelling and grammar checker on the completed manuscript. What I decided to do was hit the manuscript in a very structured manner.

Is it ready for release?
Are there any structural problems?
Are there any spelling mistakes?
Is the grammar how we want it to be?
Have all style questions been resolved?

Script 1: Chunking and Artefacts

The first task is to take Jern’s manuscript, which has been exported as a Word DOCX, and break it into chapters, or chunks if there are no real chapters. This is not strictly necessary, but it can make life easier when looking for patterns. These get stored as plain text documents, making it easy for tools to read them.

This is done using a small Python routine that reads the DOCX file with python-docx, walks through the paragraph styles, and slices the text wherever it finds a heading that looks like a chapter marker. If no such markers exist, it simply divides the manuscript into evenly sized chunks so later scripts can operate consistently. The end result is a folder full of clean .txt files—one per chapter—ready for the next stage of processing. By stripping out formatting and keeping only the raw text, I make life much easier for the analysis tools that follow.

Whilst doing this, I put in a few checks to tell me about the state of the Word document, that will let me know if it is ready for publication. For example, it looks for highlighting, coloured, strike-through, or underlined text, square brackets, and straight quotes since Jern is using smart quotes throughout.

That one about the square brackets is interesting. Since implementing ETNA, this rule has given Jern and me a great, robust, simple method to communicate. That’s not to say we don’t ever talk in our household, but it does mean that if either of us wants to leave a comment in the manuscript, we just need to put it in square brackets (technically, just one opening or closing bracket is enough), and this script will catch it. This makes it virtually impossible that I’ll ship a book that still has editorial comments in it, and that is priceless.

At this stage, I have the manuscript in chapters and I know the state of the manuscript. If this is my first look, I’ll pretty much ignore that report as it is going to be throwing up lots of concerns right now; we’re just starting editing, after all.

Script 2: Structure

My next step is to clean up most of the low-hanging fruit of the editing stage. The way Jern writes, she has already gone through multiple drafts either in her head or in writing, but there will always be things that get through, always. Things like double-spaces between words, inconsistent em dash spacing, malformed ellipses, and the such.

Everything I described there is a structural error that follows a pattern. Let’s take the double-spaces error as an example.

Two words should have a single space between them: word word
An error is having two spaces: word word
Or even four spaces: word word

There is a pattern that is correct, and there are multiple patterns that are incorrect. It’s clear, it’s predictable, and it’s exactly the sort of thing regex excels at.

If you’ve never encountered regex before, think of it as a hyper-efficient pattern engine. Instead of searching for one specific phrase, you describe the shape of the thing you’re looking for—“one or more spaces”, “a word repeated twice”, “an ellipsis used incorrectly”—and let the engine sweep through the entire manuscript looking for any place where that shape appears. Regex isn’t clever in a literary sense; it has no idea what your sentence means. But it is unrivalled when the rule you want to enforce can be expressed as a pattern. And, crucially for ETNA, it delivers identical results every single time, regardless of context or mood.

To make this fast and manageable in Python, I use the built-in re module, which provides all the regex functionality. Most of my patterns are wrapped with re.compile(). That simply means Python parses and optimises the pattern once, then reuses that compiled version throughout the scan. On a 100,000-word manuscript, this makes the analysis effectively instantaneous. It also allows me to store each compiled pattern alongside a human-readable label, so when a report says “EM_DASH_SPACING violation”, I know exactly which rule fired and why. It’s a small piece of engineering discipline that pays off in clarity and maintainability.

Here are some of the regex lines I use:

"  +" # double-spacing

"\s—\s|\s—|—\s" # em dash spacing.

"\b(\w+)\s+\1\b" # repeated words.

\.\.\. # three-dots (not an ellipsis)
| [.]… # dot before ellipsis
| …[.] # dot after ellipsis
| …{2,} # two or more ellipsis characters
| …\s*…+ # ellipsis separated by optional space(s)
| (?<=\s)…(?=\s) # ellipsis with a space before AND after

The thing about using regex like this, is it that it is totally reliable, totally repeatable, and exceedingly fast. Scanning the whole manuscript takes a fraction of a second. On the downside, it is inscrutable, or seems that way at first, second, and even twenty-fourth glance. However, AI assistants understand how to create regex strings, and it’s so fast to run that you can iterate quickly to get the regex string that does the job. With time, I’m starting to understand some of the language in regex, but I don’t really need to. Let’s look at a couple of them, though, to see what’s going on.

+ — double-spacing
There are two spaces then a plus sign in this regex pattern, so the pattern looks for two or more consecutive space characters. The two spaces are literal, meaning it looks for two actual spaces, not tabs, and the + means “one or more of these”. So whenever the manuscript contains multiple spaces between words—whether two, three, or ten—this regex will flag it. It’s a classic example of a structural slip that’s easy to miss by eye but trivial for a pattern engine to detect.
\b(\w+)\s+\1\b — repeated words
This one is a bit more clever. It captures a word using \w+ and remembers it using parentheses. Then it looks for one or more whitespace characters with \s+, followed by the same word again, written with \1, which refers back to that first capture. The \b markers make sure we’re matching whole words rather than fragments. So if the text says “the the” or “and and”, this regex catches it instantly. But it will not catch “the theme” or “grand and”.

My process, then, is to export the manuscript as a DOCX file, run these first two scripts, and go through fixing errors. The scripts are all designed to give me a bit of context in the report, so I just open up the generated report, copy some text showing the error, open the manuscript, and use the Find function to locate the mistake. After I’ve made all the fixes, which I always do manually, I can export as DOCX again and run the scripts to check if I missed anything.

After 2-3 passes, the manuscript will be structurally sound.

You might be wondering how regex knows what the correct use of em dashes is. Well, it doesn’t because there isn’t a fixed rule for this. However, readers tend to like consistency, so we have decided on House Rules, for example all em dashes in our manuscripts will be closed, that is, not have any spaces on either side. Other people, perfectly correctly, put a space on both sides. That’s fine, but we decided differently. And that sums up one of the key strengths of building something like ETNA, that by controlling the software, we are free to impose any rules we want, and stick to them.

Script 3: Spelling

Now that the structure is sound, we move on to the next-lowest-hanging fruit, spelling. Much of this is going to be black-and-white, right-or-wrong, and so it’s a nice step to get the manuscript moved along nicely. It’s also something that readers will really notice, so it is good to do this early and do it often.

The core of this script is using Hunspell as a base dictionary tool. Hunspell is the open-source spell-checking engine used by LibreOffice, Firefox, Thunderbird, and a host of other tools. It works from large dictionaries plus a set of affix rules that describe how words can be inflected — plural forms, verb endings, possessives, and so on. ETNA calls the Hunspell library directly, passing each tokenised word for a simple yes/no answer: Is this an acceptable spelling in the selected language? It doesn’t judge style, and it doesn’t care about context—its job is simply to recognise whether a spelling exists in the language. If it is a correctly spelled word, it is accepted. If it is the wrong spelling for the context, such as “I want a peace of cake“, then it is accepted. Don’t worry, we’ll catch that later, but that narrow focus is exactly what makes it so robust and predictable.

This is really fast, and the only real downside is that there are a lot of words it doesn’t know, partly because Jern has made some of them up. To help with this, I have a lovely little system of Custom Dictionaries. I have set up lots of these.

The first is a Global Dictionary, that contains words I know are fine, but Hunspell does not recognise. As an example, “glutes” is not in Hunspell, but we’ll freely use it in any novel, so it is in the Global Dictionary. Other words are okay for a specific novel, but not in every novel. So, every book also has its own Book Dictionary, with entries such as “meself” being allowed in Murder in Treggan Bay, since it is set in the Devonshire countryside. I can also put proper nouns that Hunspell doesn’t know in here, like “Bermondsey” or “Dawlish Warren“.

I even have another file that stores flag words, which are words that Hunspell would allow, but I want to flag in case they are incorrect for us. This includes things like “yogurt => yoghurt“, “curb => kerb“. These are not necessarily wrong, since curb is a good word, but I want it flagged in case I meant the British English edge to a pavement, which is kerb.

After several passes of this, I should get an empty report because the reported items have either been fixed in the manuscript, or added to one of the custom dictionaries.

Scripts 4-6: Spelling Tidy-up

With the major spelling sorted now, I have 3 quick scripts that often show nothing.

On occasion, like many people whose native language is Asian, Jern can swap L and R in words. One little script picks those out if they happen.
Eddie and Eddy are both common names, but what if someone’s name drifts, or switches in the novel? This script identifies all proper nouns, those being words that start with a capital letter (ignoring words at the start of a sentence). It then checks these against one another using a fuzzy match technique. The tool, RapidFuzz, can tell that Eddie is similar to Eddy, and that Eddie appeared 5 times and Eddy only once. The report can present this information to me, so I can judge if this is intentional or not. Again, let me reiterate that I never have ETNA changing the manuscript. All changes are manual, based on the reports provided by ETNA.
This script looks for a list of compound words I supplied, and makes sure I am being consistent. So, I’ve decided that “timeline“, should be a single word, without a hyphen or a space. If the script finds “time-line” or “time line” in the text, it gets reported.

Script 7 and beyond

The manuscript is in great shape now, so it is time to move on to grammar. Grammar is a whole beast to itself, so I will break off now and pick up how I use the fabulous LanguageTool to do the grammar check for me in my next segment.

Recap

The core purpose of ETNA, my analysis tool, is to free up Jern so that she can focus on her writing, on the creative process. We have a system that means she can turn off spelling- and grammar-checking on her iPad. No squiggly red lines, no assistants suggesting other ways to write things. She can focus and create writing true to vision. Or true to the characters’ vision, anyway.

The core methodology of ETNA is to provide focused, precise reports about potential errors so that I can make a judgment, and then I can manually make corrections to the manuscript. There is zero chance of the computer messing things up because the computer never writes to the manuscript. Which means that any mistakes you find in a novel are my fault.

Beyond those core ideas, ETNA is a collection of tools; Python scripts that iterate over the chapters of the manuscript and identify increasingly subjective items, from structural errors to typography slips, to spelling, to grammar, and a touch of style. In this segment, I went through the first, more objective, portion of the tool set, with its focus on regex and Hunspell. In my next segment, I will look at the more subjective half, with its focus on LanguageTool and RapidFuzz matching.

If you want to know about ETNA, Eddie Tonkoi’s Narrative Analysis, come and ask me over in the Slack community at podfeet.com/slack, where I and all the other lovely NosillaCastaways enjoy friendly, positive online conversations. Feel free to message me, Eddie Tonkoi, if you have any thoughts, questions, or techniques you’re using. It would be nice to share ideas.

You can also find our work at jerntonkoi.com, where you’ll find Jern’s character-driven queer love stories, the audiobooks I produce for them, and bonus material for our subscribers.

I’ll be back soon to talk through some more of my workflow but, for now, happy editing, and happy reading.