{"id":35063,"date":"2026-01-01T08:03:06","date_gmt":"2026-01-01T16:03:06","guid":{"rendered":"https:\/\/www.podfeet.com\/blog\/?p=35063"},"modified":"2026-01-01T08:03:06","modified_gmt":"2026-01-01T16:03:06","slug":"building-a-fiction-editing-pipeline-part-2-grammar-eddie-tonkoi","status":"publish","type":"post","link":"https:\/\/www.podfeet.com\/blog\/2026\/01\/building-a-fiction-editing-pipeline-part-2-grammar-eddie-tonkoi\/","title":{"rendered":"Building a Fiction Editing Pipeline Part 2 &#8211; Grammar \u2014 Eddie Tonkoi"},"content":{"rendered":"<h2>Intro<\/h2>\n<p>In my last article, (<a href=\"https:\/\/www.podfeet.com\/blog\/2025\/12\/editing-pipeline-structur-spelling-eddie-tonkoi\/\">Building a Fiction Editing Pipeline with Regex and Python \u2014 Part 1 by Eddie Tonkoi<\/a>), I introduced my pipeline, including the rationale behind my Narrative Analysis toolkit, ETNA. This time, I\u2019ll be looking at the more subjective part, starting with grammar.<\/p>\n<h2>The problem with grammar<\/h2>\n<p>Unlike syntax and spelling, grammar is a fairly subjective ruleset, especially in a fiction novel. There are plenty of rules that just look plain wrong if they are broken, but there are also others that are more like guidelines, some that are dependent on region, and others that are broken willy-nilly in dialogue.<\/p>\n<p>Last time, I said that one of my requirements was,<\/p>\n<ul>\n<li>I want it to remember my custom words, custom phrases, and things I have chosen to <em>ignore<\/em> within a single book.<\/li>\n<\/ul>\n<p>This is one of the reasons why I didn&#8217;t want to go with Grammarly or other solutions that offered some control, but not enough. Or at least, not on the free tier. Not that I blame them.<br \/>\nInstead, I came across LanguageTool via OpenOffice and a conversation with ChatGPT.<\/p>\n<h2>LanguageTool (LT)<\/h2>\n<p><a href=\"https:\/\/languagetool.org\">LanguageTool<\/a> started life back in 2003 as a university project by Daniel Naber and has since grown into a widely used, open-source grammar and style checker. At its heart it\u2019s a rule-based, language-independent engine designed to support many languages, especially those that don\u2019t have expensive commercial tools, and it\u2019s now maintained by a community of volunteer contributors. No Thai yet, by the looks of it, though. In recent years it\u2019s added optional cloud-based, AI-powered checks, but the core engine remains free, offline-capable, and fully customisable. That combination\u2014transparent rules, strong grammar coverage, and the ability to run my own local server and tweak what it flags\u2014makes it a perfect fit for building a bespoke fiction-editing pipeline.<\/p>\n<h3>LT Pipeline<\/h3>\n<p>LT starts by segmenting the text into sentences and tokens, which means that it breaks the text following a stringent set of rules into words and phrases. It is then able to tag these into Parts-of-Speech (POS), for example, identifying verbs, nouns, adjectives, etc. It&#8217;s also able to group things by knowing that <em>children<\/em> is the plural form of <em>child<\/em>, for example.<\/p>\n<p>Now, with sentences abstracted away into Parts-of-Speech rather than worrying about what the words are, it can run its extensive suite of rules on the sentence or phrase to see if it breaks any rules.<\/p>\n<p>If something is found to break a rule, LT returns the suspect along with an explanation of the rule and a suggestion for what it should be. Depending on the mistake, the wording used could be forceful (Sentences must\u2026), or gentle (You might have meant\u2026).<\/p>\n<h3>Running a server<\/h3>\n<p>LT can be run with a CLI, a command-line interface, and that lets me set it up to use with my Python scripts.<\/p>\n<p>At a high level, the process is:<\/p>\n<ul>\n<li>Install Java so you can run LanguageTool on your machine.<\/li>\n<li>Download the LanguageTool package that includes the server JAR.<\/li>\n<li>Start the local HTTP server on a chosen port (I use 8081).<\/li>\n<li>Confirm it\u2019s responding by hitting one of its test endpoints in a browser or with a simple request.<\/li>\n<li>Point your Python script at that local server and run checks using the en-GB ruleset.<\/li>\n<li>Shut down the server cleanly on exit.<\/li>\n<\/ul>\n<p>That&#8217;s LT, and it runs beautifully by itself, but I&#8217;ve added some extra suppressions and house rules<\/p>\n<h3>Suppressions and House Rules<\/h3>\n<p>Once LanguageTool is running, I add two layers on top to make it behave like a fiction editor rather than a schoolteacher.<\/p>\n<h4>Suppressions (keep the signal, drop the noise)<\/h4>\n<p>First, I load a small suppression list for this book. It contains:<\/p>\n<ul>\n<li>ignored phrases\u2014things that are intentionally \u201cwrong\u201d in context: character voice, dialect spellings, invented names, recurring wording, and stylistic tics that are part of the narrative; and<\/li>\n<li>disabled rule IDs\u2014specific LanguageTool checks I don\u2019t want in this project, either because they don\u2019t suit fiction, or because I\u2019ve decided they\u2019re no longer worth spending attention on for this book.<\/li>\n<\/ul>\n<p>That way, the report stays readable, and I\u2019m not re-litigating the same deliberate choices chapter after chapter, time after time.<\/p>\n<p>The importance of this is due to the fact that I want to clear every single issue. Let&#8217;s say there are 15 instances of LT warning me about the use of &#8220;he shoulda known better&#8221;. I checked each one, and they are all in dialogue, all idiomatically correct. By adding the rule that checks for &#8220;shoulda&#8221; to my ignore list, these all stop showing, which reduces noise and lets me see clearly. I think this method greatly reduces the chance of overlooking an error.<\/p>\n<h4>House Rules (teach LT my style)<\/h4>\n<p>Then come my actual House Rules: custom LanguageTool XML rules that encode the way Jern writes in this series\u2014preferred forms, banned forms, and recurring patterns I want caught consistently. These aren\u2019t \u201cignore it when it happens\u201d; they\u2019re \u201cflag it every time, because it matters\u201d.<\/p>\n<p>Put together, suppressions stop the tool wasting my time, and House Rules turn it into a repeatable, book-specific style checker.<\/p>\n<p>One of my simplest and most important House Rules is a British style preference: use \u201ctowards\u201d rather than \u201ctoward\u201d, and more generally prefer \u201c-wards\u201d adverbs like backwards, downwards, inwards.<\/p>\n<p>LanguageTool will flag the shorter forms and suggest the \u201c-s\u201d version\u2014but the rule is smart enough not to fire when the word is acting like an adjective before a noun. So it will catch \u201cHe stumbled backward into the chair\u201d \u2192 backwards, but it won\u2019t nag about \u201cbackward compatibility\u201d or \u201can upward trend\u201d.<\/p>\n<p>The importance of this is that &#8220;backwards&#8221; and &#8220;backward are both technically correct, so LT wouldn&#8217;t normally flag things, but I&#8217;ve decided that I want to be consistent with all of our books, so I wanted a clear rule. LT allowed me to create my own, clear tool.<\/p>\n<h4>Technical detail<\/h4>\n<p>Now, if you&#8217;ll forgive me, I&#8217;ll spend a couple of minutes descibing that rule a bit more, then get back to the overview.<\/p>\n<p>In LanguageTool, a \u201cHouse Rule\u201d is basically a little XML pattern-matcher. Each rule has a unique id (so I can turn it on\/off later), a human-readable name, and then a <strong>&lt;pattern&gt;<\/strong> made of one or more <strong>&lt;token&gt;<\/strong> elements.<\/p>\n<p>A token is what it sounds like: one word (or sometimes punctuation), but crucially it can be case-insensitive and regex-based. So my \u201ctoward \u2192 towards\u201d rule is just a single-token pattern that matches toward regardless of case, then a <strong>&lt;message&gt;<\/strong> explaining why it\u2019s flagged and a <strong>&lt;suggestion&gt;<\/strong> offering the replacement.<\/p>\n<p>The slightly clever bit is where it becomes POS-aware. In the -ward \u2192 -wards rule, the token matches a whole list of words using regexp=&#8221;yes&#8221; (so one rule covers backward, upward, inward\u2026). Then I add an <strong>&lt;exception scope=&#8221;next&#8221; postag=&#8221;NN.*&#8221; postag_regexp=&#8221;yes&#8221;\/&gt;<\/strong>, which means \u201cdon\u2019t fire if the next token is a noun\u201d\u2014because backward compatibility is adjectival, not an adverb.<\/p>\n<p>For the suggestion, I\u2019m not hard-coding every replacement; I use <strong>&lt;match \u2026 regexp_replace=\u2026&gt;<\/strong> to rewrite the matched word, so (.*)ward becomes $1wards. That\u2019s the general pattern: match tokens, optionally exempt based on neighbouring POS tags, then generate a suggestion, with a couple of <strong>&lt;example&gt;<\/strong> lines so I can sanity-check it later.<\/p>\n<p>The output in the report, designed and generated by my Python code, has been formatted to provide that information back to me in a meaningful way. Here is an example:<\/p>\n<pre><code>\ud83e\udde9 Rule: 'ETNA_WARD_ADD_S'\nMessage: Use the '-wards' form for consistency.\nCategory: ETNA_WARD_RULES\nIssue type: uncategorized\n\nOccurrences: 12\n\n\ud83d\udccd Chunk: 'chapter_004.txt'\nSuggestion:\n\u200b onwards\n\nContext:\n...rry on, unaffected. The golden one zips onward, bouncing off invisible edges, zigzaggi...\n<\/code><\/pre>\n<p>Notice how the report groups everything by <strong>rule ID<\/strong> first\u2014that\u2019s deliberate. It means I can make one decision and apply it consistently: fix all instances, suppress a noisy phrase, or disable the rule entirely. Under that, it tells me the <strong>human explanation<\/strong> LanguageTool provides, the <strong>category<\/strong> I filed it under (in this case my ETNA rules), and the <strong>issue type<\/strong>, which in theory helps me triage what\u2019s \u201creal grammar\u201d versus \u201cstyle preference\u201d, but in practice I&#8217;ve ignored so far. Then for each hit I get the <strong>suggestion<\/strong> and a short <strong>context snippet<\/strong> showing exactly where it fired, plus which chapter chunk it came from\u2014enough to judge whether it\u2019s narration, dialogue, or a deliberate voice choice before I accept the change.<\/p>\n<p>If necessary, I can quickly copy some text from the <em>Context<\/em> and use <strong>Find<\/strong> in Scrivener to see it in full context and decide whether to make changes.<\/p>\n<p>As I have said before, I will make all changes myself manually, and I never let the tool change the text.<\/p>\n<h3>Checking the grammar<\/h3>\n<p>LanguageTool is an amazing piece of software that does superb grammar analysis, handling higher-level issues such as:<\/p>\n<ul>\n<li>grammar<\/li>\n<li>homophone confusion<\/li>\n<li>idiom errors<\/li>\n<li>punctuation misuse<\/li>\n<li>British vs American forms<\/li>\n<li>my custom direction rules (&#8220;towards&#8221;, etc.)<\/li>\n<\/ul>\n<p>This has overlap with previous tools, but also does a whole lot more. LanguageTool is the closest thing to a human editor\u2014but it&#8217;s not perfect, and it can take quite some time to go through all of its suggestions, choosing which to take on board. That&#8217;s why I implemented some of the previous checks, so that by the time I get to this script with LanguageTool, a lot of the low-hanging fruit has already been dealt with in a calm, systematic manner.<\/p>\n<p>I will run this script countless times over the process, picking things off bit by bit, starting with the certain mistakes and leaving any style choices until the book has cleared Jern&#8217;s final edit. This script&#8217;s report always has the most entries and is always the last to be cleared, but it really does help me polish the novel.<\/p>\n<p>I have a few more small scripts still, with specific tasks.<\/p>\n<h2>Duplicate Phrase Drift<\/h2>\n<p>This Python script uses fuzzy matching to detect lines that should match exactly but don\u2019t. In <strong>Murder in Treggan Bay<\/strong>, there&#8217;s a shop sign that recurs in the book. This script was able to tell me that the phrase written on it was incorrect in one instance:<\/p>\n<ul>\n<li>\u201cPlease help yourself. Be back in 15 minutes.\u201d<br \/>\nvs<\/li>\n<li>\u201cPlease help yourself. Back in 15 minutes.\u201d<\/li>\n<\/ul>\n<p>Again, it is using <strong>RapidFuzz<\/strong>, but with quite tight settings, including a fuzzy score \u2265 95, so that we don&#8217;t get swamped by false positives.<\/p>\n<p>I was so impressed when this actually worked!<\/p>\n<p>This is on my todo list to find out what else RapidFuzz can do for me.<\/p>\n<h2>Simile &amp; Crutch Word Analysis<\/h2>\n<p>This script gives me a 40,000 ft view of stylistic patterns, to see if anything stands out. It does a few things that I thought might be useful, but I have yet to see if it will bring up anything of substance:<\/p>\n<ul>\n<li>\u201clike a [noun]\u201d similes, so looking for over-reliance on <strong>like<\/strong> for similes,<\/li>\n<li>repeated head nouns (\u201clike a cat\u201d, \u201clike a ghost\u201d), so finding if the writing falls back on the same simile, though of course it cannot tell if this was intentional,<\/li>\n<li>crutch words (\u201cjust\u201d, \u201creally\u201d, \u201csuddenly\u201d), just in case they repeat very often,<\/li>\n<li>clich\u00e9 detection, looking for clich\u00e9s and counting how often they are used, which has been surprisingly little.<\/li>\n<\/ul>\n<p>Once again, the goal here is to offer insight into voice and habits without prescribing solutions. This and the next script were added as final additions when I asked ChatGPT if there was anything I could, or rather should, add to my pipeline without moving to AI.<\/p>\n<h2>Gesture &amp; Intensifier Repetition<\/h2>\n<p>This final script tracks habitual gestures &amp; intensifiers:<\/p>\n<ul>\n<li>shrug, sigh, glance, stare<\/li>\n<li>slowly, gently, suddenly<\/li>\n<li>very, really, just<\/li>\n<\/ul>\n<p>I&#8217;ve only just added this one and not really tried it yet. The idea is to see if these gestures, or adverbs occur frequently and to flag them.<\/p>\n<h1>Recap<\/h1>\n<p>To provide consistent, customisable spelling and grammar checking for Jern&#8217;s novels, I created a pipeline based around Python scripts that uses regex, Hunspell, and LanguageTool to generate reports for different potential issues.<\/p>\n<p>This gives me confidence that I can work through a myriad of potential issues without missing any, and it allows me to customise on a per-book basis.<\/p>\n<ul>\n<li>Regex checks let me look for structural errors.<\/li>\n<li>Hunspell gives the power of custom dictionaries for each book.<\/li>\n<li>LanguageTool allows customised grammar-checks.<\/li>\n<\/ul>\n<p>With these reports, I am able to choose what is an error and what was deliberate or just noise, and then manually make changes to either my custom rules or the manuscript.<br \/>\nAs I do this, the reports get shorter until they are totally clear, at which point I have now arranged for it to provide an official Certificate of Correctness, awarded to the novel when the full pipeline runs without any flagged errors.<\/p>\n<p>If you&#8217;d like to know more about how this editing pipeline works, or you have other solutions for these problems, please come over to <a href=\"https:\/\/podfeet.com\/slack\">podfeet.com\/slack<\/a>, where you can find me and all the other lovely NosillaCastaways enjoying a chat. I&#8217;ve also started trying out Mastodon, where I&#8217;m <a href=\"https:\/\/mastodon.social\/@eddietonkoi\">@EddieTonkoi<\/a>.<\/p>\n<p>To find out more about Jern&#8217;s writing you can head over to <a href=\"jerntonkoi.com\">jerntonkoi.com<\/a>, which has detailed information about all her books, or follow her on Instagram where she is <a href=\"https:\/\/www.instagram.com\/tonkoibooks\/\">@tonkoibooks<\/a>. You can also find out how I created that website on previous segments for the <a href=\"https:\/\/www.podfeet.com\/blog\/?s=tonkoi\">NosillaCast<\/a>.<\/p>\n<p>Happy editing, and happy reading.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Intro In my last article, (Building a Fiction Editing Pipeline with Regex and Python \u2014 Part 1 by Eddie Tonkoi), I introduced my pipeline, including the rationale behind my Narrative Analysis toolkit, ETNA. This time, I\u2019ll be looking at the more subjective part, starting with grammar. The problem with grammar Unlike syntax and spelling, grammar [&hellip;]<\/p>\n","protected":false},"author":34,"featured_media":35149,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[147],"tags":[1105,1107,1934,7739,1899],"class_list":["post-35063","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog-posts","tag-book","tag-editing","tag-grammar","tag-language-tool","tag-open-source"],"jetpack_featured_media_url":"https:\/\/www.podfeet.com\/blog\/wp-content\/uploads\/2025\/12\/Eddie-very-silly-certificate-of-success-listing-all-of-the-elements-he-has-described-1040x520-1.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts\/35063","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/users\/34"}],"replies":[{"embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/comments?post=35063"}],"version-history":[{"count":16,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts\/35063\/revisions"}],"predecessor-version":[{"id":35143,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/posts\/35063\/revisions\/35143"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/media\/35149"}],"wp:attachment":[{"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/media?parent=35063"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/categories?post=35063"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.podfeet.com\/blog\/wp-json\/wp\/v2\/tags?post=35063"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}