Eddie very silly certificate of success listing all of the elements he describes in the articles

Building a Fiction Editing Pipeline Part 2 – Grammar — Eddie Tonkoi

Intro

In my last article, (Building a Fiction Editing Pipeline with Regex and Python — Part 1 by Eddie Tonkoi), I introduced my pipeline, including the rationale behind my Narrative Analysis toolkit, ETNA. This time, I’ll be looking at the more subjective part, starting with grammar.

The problem with grammar

Unlike syntax and spelling, grammar is a fairly subjective ruleset, especially in a fiction novel. There are plenty of rules that just look plain wrong if they are broken, but there are also others that are more like guidelines, some that are dependent on region, and others that are broken willy-nilly in dialogue.

Last time, I said that one of my requirements was,

  • I want it to remember my custom words, custom phrases, and things I have chosen to ignore within a single book.

This is one of the reasons why I didn’t want to go with Grammarly or other solutions that offered some control, but not enough. Or at least, not on the free tier. Not that I blame them.
Instead, I came across LanguageTool via OpenOffice and a conversation with ChatGPT.

LanguageTool (LT)

LanguageTool started life back in 2003 as a university project by Daniel Naber and has since grown into a widely used, open-source grammar and style checker. At its heart it’s a rule-based, language-independent engine designed to support many languages, especially those that don’t have expensive commercial tools, and it’s now maintained by a community of volunteer contributors. No Thai yet, by the looks of it, though. In recent years it’s added optional cloud-based, AI-powered checks, but the core engine remains free, offline-capable, and fully customisable. That combination—transparent rules, strong grammar coverage, and the ability to run my own local server and tweak what it flags—makes it a perfect fit for building a bespoke fiction-editing pipeline.

LT Pipeline

LT starts by segmenting the text into sentences and tokens, which means that it breaks the text following a stringent set of rules into words and phrases. It is then able to tag these into Parts-of-Speech (POS), for example, identifying verbs, nouns, adjectives, etc. It’s also able to group things by knowing that children is the plural form of child, for example.

Now, with sentences abstracted away into Parts-of-Speech rather than worrying about what the words are, it can run its extensive suite of rules on the sentence or phrase to see if it breaks any rules.

If something is found to break a rule, LT returns the suspect along with an explanation of the rule and a suggestion for what it should be. Depending on the mistake, the wording used could be forceful (Sentences must…), or gentle (You might have meant…).

Running a server

LT can be run with a CLI, a command-line interface, and that lets me set it up to use with my Python scripts.

At a high level, the process is:

  • Install Java so you can run LanguageTool on your machine.
  • Download the LanguageTool package that includes the server JAR.
  • Start the local HTTP server on a chosen port (I use 8081).
  • Confirm it’s responding by hitting one of its test endpoints in a browser or with a simple request.
  • Point your Python script at that local server and run checks using the en-GB ruleset.
  • Shut down the server cleanly on exit.

That’s LT, and it runs beautifully by itself, but I’ve added some extra suppressions and house rules

Suppressions and House Rules

Once LanguageTool is running, I add two layers on top to make it behave like a fiction editor rather than a schoolteacher.

Suppressions (keep the signal, drop the noise)

First, I load a small suppression list for this book. It contains:

  • ignored phrases—things that are intentionally “wrong” in context: character voice, dialect spellings, invented names, recurring wording, and stylistic tics that are part of the narrative; and
  • disabled rule IDs—specific LanguageTool checks I don’t want in this project, either because they don’t suit fiction, or because I’ve decided they’re no longer worth spending attention on for this book.

That way, the report stays readable, and I’m not re-litigating the same deliberate choices chapter after chapter, time after time.

The importance of this is due to the fact that I want to clear every single issue. Let’s say there are 15 instances of LT warning me about the use of “he shoulda known better”. I checked each one, and they are all in dialogue, all idiomatically correct. By adding the rule that checks for “shoulda” to my ignore list, these all stop showing, which reduces noise and lets me see clearly. I think this method greatly reduces the chance of overlooking an error.

House Rules (teach LT my style)

Then come my actual House Rules: custom LanguageTool XML rules that encode the way Jern writes in this series—preferred forms, banned forms, and recurring patterns I want caught consistently. These aren’t “ignore it when it happens”; they’re “flag it every time, because it matters”.

Put together, suppressions stop the tool wasting my time, and House Rules turn it into a repeatable, book-specific style checker.

One of my simplest and most important House Rules is a British style preference: use “towards” rather than “toward”, and more generally prefer “-wards” adverbs like backwards, downwards, inwards.

LanguageTool will flag the shorter forms and suggest the “-s” version—but the rule is smart enough not to fire when the word is acting like an adjective before a noun. So it will catch “He stumbled backward into the chair” → backwards, but it won’t nag about “backward compatibility” or “an upward trend”.

The importance of this is that “backwards” and “backward are both technically correct, so LT wouldn’t normally flag things, but I’ve decided that I want to be consistent with all of our books, so I wanted a clear rule. LT allowed me to create my own, clear tool.

Technical detail

Now, if you’ll forgive me, I’ll spend a couple of minutes descibing that rule a bit more, then get back to the overview.

In LanguageTool, a “House Rule” is basically a little XML pattern-matcher. Each rule has a unique id (so I can turn it on/off later), a human-readable name, and then a <pattern> made of one or more <token> elements.

A token is what it sounds like: one word (or sometimes punctuation), but crucially it can be case-insensitive and regex-based. So my “toward → towards” rule is just a single-token pattern that matches toward regardless of case, then a <message> explaining why it’s flagged and a <suggestion> offering the replacement.

The slightly clever bit is where it becomes POS-aware. In the -ward → -wards rule, the token matches a whole list of words using regexp=”yes” (so one rule covers backward, upward, inward…). Then I add an <exception scope=”next” postag=”NN.*” postag_regexp=”yes”/>, which means “don’t fire if the next token is a noun”—because backward compatibility is adjectival, not an adverb.

For the suggestion, I’m not hard-coding every replacement; I use <match … regexp_replace=…> to rewrite the matched word, so (.*)ward becomes $1wards. That’s the general pattern: match tokens, optionally exempt based on neighbouring POS tags, then generate a suggestion, with a couple of <example> lines so I can sanity-check it later.

The output in the report, designed and generated by my Python code, has been formatted to provide that information back to me in a meaningful way. Here is an example:

🧩 Rule: 'ETNA_WARD_ADD_S'
Message: Use the '-wards' form for consistency.
Category: ETNA_WARD_RULES
Issue type: uncategorized

Occurrences: 12

📍 Chunk: 'chapter_004.txt'
Suggestion:
​ onwards

Context:
...rry on, unaffected. The golden one zips onward, bouncing off invisible edges, zigzaggi...

Notice how the report groups everything by rule ID first—that’s deliberate. It means I can make one decision and apply it consistently: fix all instances, suppress a noisy phrase, or disable the rule entirely. Under that, it tells me the human explanation LanguageTool provides, the category I filed it under (in this case my ETNA rules), and the issue type, which in theory helps me triage what’s “real grammar” versus “style preference”, but in practice I’ve ignored so far. Then for each hit I get the suggestion and a short context snippet showing exactly where it fired, plus which chapter chunk it came from—enough to judge whether it’s narration, dialogue, or a deliberate voice choice before I accept the change.

If necessary, I can quickly copy some text from the Context and use Find in Scrivener to see it in full context and decide whether to make changes.

As I have said before, I will make all changes myself manually, and I never let the tool change the text.

Checking the grammar

LanguageTool is an amazing piece of software that does superb grammar analysis, handling higher-level issues such as:

  • grammar
  • homophone confusion
  • idiom errors
  • punctuation misuse
  • British vs American forms
  • my custom direction rules (“towards”, etc.)

This has overlap with previous tools, but also does a whole lot more. LanguageTool is the closest thing to a human editor—but it’s not perfect, and it can take quite some time to go through all of its suggestions, choosing which to take on board. That’s why I implemented some of the previous checks, so that by the time I get to this script with LanguageTool, a lot of the low-hanging fruit has already been dealt with in a calm, systematic manner.

I will run this script countless times over the process, picking things off bit by bit, starting with the certain mistakes and leaving any style choices until the book has cleared Jern’s final edit. This script’s report always has the most entries and is always the last to be cleared, but it really does help me polish the novel.

I have a few more small scripts still, with specific tasks.

Duplicate Phrase Drift

This Python script uses fuzzy matching to detect lines that should match exactly but don’t. In Murder in Treggan Bay, there’s a shop sign that recurs in the book. This script was able to tell me that the phrase written on it was incorrect in one instance:

  • “Please help yourself. Be back in 15 minutes.”
    vs
  • “Please help yourself. Back in 15 minutes.”

Again, it is using RapidFuzz, but with quite tight settings, including a fuzzy score ≥ 95, so that we don’t get swamped by false positives.

I was so impressed when this actually worked!

This is on my todo list to find out what else RapidFuzz can do for me.

Simile & Crutch Word Analysis

This script gives me a 40,000 ft view of stylistic patterns, to see if anything stands out. It does a few things that I thought might be useful, but I have yet to see if it will bring up anything of substance:

  • “like a [noun]” similes, so looking for over-reliance on like for similes,
  • repeated head nouns (“like a cat”, “like a ghost”), so finding if the writing falls back on the same simile, though of course it cannot tell if this was intentional,
  • crutch words (“just”, “really”, “suddenly”), just in case they repeat very often,
  • cliché detection, looking for clichés and counting how often they are used, which has been surprisingly little.

Once again, the goal here is to offer insight into voice and habits without prescribing solutions. This and the next script were added as final additions when I asked ChatGPT if there was anything I could, or rather should, add to my pipeline without moving to AI.

Gesture & Intensifier Repetition

This final script tracks habitual gestures & intensifiers:

  • shrug, sigh, glance, stare
  • slowly, gently, suddenly
  • very, really, just

I’ve only just added this one and not really tried it yet. The idea is to see if these gestures, or adverbs occur frequently and to flag them.

Recap

To provide consistent, customisable spelling and grammar checking for Jern’s novels, I created a pipeline based around Python scripts that uses regex, Hunspell, and LanguageTool to generate reports for different potential issues.

This gives me confidence that I can work through a myriad of potential issues without missing any, and it allows me to customise on a per-book basis.

  • Regex checks let me look for structural errors.
  • Hunspell gives the power of custom dictionaries for each book.
  • LanguageTool allows customised grammar-checks.

With these reports, I am able to choose what is an error and what was deliberate or just noise, and then manually make changes to either my custom rules or the manuscript.
As I do this, the reports get shorter until they are totally clear, at which point I have now arranged for it to provide an official Certificate of Correctness, awarded to the novel when the full pipeline runs without any flagged errors.

If you’d like to know more about how this editing pipeline works, or you have other solutions for these problems, please come over to podfeet.com/slack, where you can find me and all the other lovely NosillaCastaways enjoying a chat. I’ve also started trying out Mastodon, where I’m @EddieTonkoi.

To find out more about Jern’s writing you can head over to jerntonkoi.com, which has detailed information about all her books, or follow her on Instagram where she is @tonkoibooks. You can also find out how I created that website on previous segments for the NosillaCast.

Happy editing, and happy reading.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top