CCATP #649 – Dr. Helma van der Linden on Creating a Book with Open Source Software

This week our guest is Dr. Helma van der Linden from the Netherlands here to talk about how she created the Taming the Terminal book using all Open Source software.

On the NosillaCast I talked a lot about the book but I should probably give an explanation for the Chit Chat audience. Bart Busschots and I created the Taming the Terminal podcast and for this series on learning the macOS (and Linux) command line, Bart wrote a spectacular set of tutorial shownotes on his website at bartb.ie.

I had a dream for many years to make Taming the Terminal into a book as a surprise for Bart, but my few attempts to do it failed spectacularly. Around Easter, I mentioned my dream to Helma and she said, “I bet I could do that!”

I don’t expect anyone to learn from this discussion how to do what Helma did, but rather to learn what’s possible and how cool it was that Helma put these pieces together. Below are the rough shownotes we used for our discussion, included so that you would have the links to all of the tools she explained in the episode

Downloads of the book can be found at podfeet.com/tttbook and the GitHub project for Taming the Terminal can be found at github.com/bartificer/taming-the-terminal. You can communicate with Helma through our Slack community at podfeet.com/slack where her handle is @Helma.

mp3 download

Helma and I worked on this as a Github project where we both posted issues for ourselves and for each other and worked on it in tandem. I was merely the editor of the book, but what Helma did behind the scenes was actually a mystery to me until last week

Q – Before we start, is it a true statement that every tool you used to create this book was open source or at least free (Grammarly)?

HL: Yes

Let’s start walking through how this happened.

Text of TTT was in html on bartb.ie
What format did you originally want to convert the text into?

HL: It started off with the conversion of the PBS HTML into Markdown. I wanted to help out by converting some of the older episodes to Markdown by using some nifty grep statements in shell scripts. I quickly realised that that was a very time-consuming task. So I thought ‘somebody must have solved this problem already’. After some googling I found a NodeJS package called Turndown (github.com/domchristie/turndown) that could convert HTML to Markdown. It’s extendable and configurable so I wrote pbsconvert (github.com/hepabolu/pbsconvert) and managed to convert all episodes to Markdown.

For TTT I first wanted to turn the HTML straight into an ePub. I found an article online that explained how to do it but I quickly learned 2 things:

The HTML consists of more sections than just the main content. So each file had to be stripped of the menu and other irrelevant parts.
The article I found involved a lot of manual work. I really didn’t want to convert all 37 episodes by hand, find a mistake and having to start over again.

So I changed the idea to convert the episodes to Markdown with a slight modification of pbsconvert and then generate an ePub out of those Markdown files. So I created a new little program called tttconvert (in the github repo github.com/bartificer/taming-the-terminal/tree/master/tttconvert) that was basically a copy of pbsconvert with a few tweaks.

Q – How did you deal with the images?

For pbsconvert I had already decided that I was not going to download all images and all zip files by hand so in pbsconvert I added a little script that when it detected an image tag or a link to a zip file, it would download the file and put it in a matching folder. So all I had to do was use that same functionality in tttconvert.

Inconsistency in the content was problematic – can you explain what tools you used to make the book self-consistent?

The consistency caused some problems in the naming of the files and the folders holding the assets such as the images. So I changed that in tttconvert to be more consistent. Later on I manually fixed some inconsistencies so that the text and the links in the audio section would look the same in each episode.

Syntax highlighting is really important when showing code blocks – allows people to more easily interpret it. What tool did Bart use and what did you use?

Bart uses Crayon, a plugin for WordPress wordpress.org/plugins/crayon-syntax-highlighter/ on his website to create highlighted code blocks. The Crayon plugin converts the code block into a table with the highlighted code and hides the actual textarea that contains the original code as Bart has put in. All this is wrapped in divs that mark where the Crayon CSS needs to be applied. So in the pbsconvert code (and therefore tttconvert) I had to find the outer wrapper div and in there find a reference to which language it was. Rather straightforward in tttconvert, but more variation in pbsconvert.

Q – You wanted to preserve white space for indents so the code stayed readable – how did you do that?

Crayon splits up every line into sections with spans and color codes them accordingly. Way to intricate to revert to simple text AND preserve the indent. So In the div I had to search for the textarea with the original content. That was great, until I found that the Turndown code stripped out all the whitespace such as new lines and indent spaces. A nicely indented piece of code would end up as a single line. After hours of search, I found the reason and finally hacked the Turndown code to make it work for the Crayon code.

Q – Once you started working with Markdown and tried to publish the book, what problems did you run into?

Markdown doesn’t support definition lists and Bart LOVES them

Q – What’s a definition list for those unfamiliar with the term?

Markdown is built for long-form text such as articles and blog posts. All the standard markup for such texts are available. But other kinds of texts such as technical documentation require much more markup. That’s why different extensions of Markdown were created. Each solving different problems. GitHub’s version of Markdown is also an extension called GitHub Flavored Markdown or GFM. This was the version I wanted to use for pbsconvert, because Bart uses it for his newer episodes. And therefore also for tttconvert.

But Bart uses definition lists a lot and they are not supported in the original Markdown nor GFM.

A definition list is a special HTML markup for lists that hold a word and the explanation of that word. So each element of a definition list consists of a combination of the word (or phrase or whatever) and its explanation. It’s typically used for glossaries and other dictionary-like lists.

So I had to turn Bart’s definition lists into markdown tables, by hand, to build something that more or less looked like the original list. Thank goodness for source code editors like VSCode that support multiple cursors so the work didn’t take many hours to complete.

I found a tool called Pandoc (github.com/jgm/pandoc) that would be able to convert the Markdown into an ePub, but I couldn’t make it work to generate 1 ePub out of 37 different files. It also only supported Pygments (pygments.org/) as source code highlighter and Pygments didn’t support a lot of builtin commands of Bash. I found a code snippet that basically created a new plugin for Pandoc and extended Pygments to support the builtin commands but Pandoc is written in Haskell and the plugins in Lua. Both are languages I’ve only heard of and I couldn’t get it to work.

Also, the only way to check if the conversion was similar to the original I had to build an ePub, open it in an ePub reader, wait for the pages to render and check it with the original pages on Bart’s website. Rinse and repeat for each error found. A very tedious and time-consuming process.

I had a brief (say 1 hour) idea of using iBooks Author for the project, but quickly found that it didn’t support the links to external MP3 files (the podcast audio files) so that was a dead end. (Note, iBooks Author is also deprecated as of July 1, 2020.)

I looked around for a suitable example of a book that used Markdown and transform into an ePub and came across the Pro Git book (github.com/progit/progit). This is a highly praised book on the use of git. They used Calibre (calibre-ebook.com) to convert the book to epub. I tried that, but I couldn’t get it to work.

Q – If Markdown wasn’t going to work, how did you discover a different solution?

The README file of the Pro Git repository explained that I was looking at the archived version 1 of the book and that they had created a version 2 (github.com/progit/progit2). The second version explained they had changed the Markdown to AsciiDoc. I followed the link and learned that AsciiDoc is written in Python in 2002 (en.wikipedia.org/wiki/AsciiDoc). So it predates Markdown by 2 years (Markdown was written in 2004). In 2013 a new and extended implementation was written in Ruby and named AsciiDoctor (github.com/asciidoctor/asciidoctor) (asciidoctor.org/docs/asciidoc-asciidoctor-diffs/).

According to Wikipedia O’Reilly uses Asciidoc or Asciidoctor and the Pro Git book version 2 was also written in AsciiDoctor, so it must be serious stuff.

Q – Can you explain what AsciiDoctor actually is, what is it like, is it like Markdown?

AsciiDoctor like Markdown is a text-based lightweight markup language but written to also support more technical documents and books, so it supports definition lists, has all kinds of macros (shorthand codes for often used html) for keyboard codes and menu paths.

They have written their own extensive user manual in AsciiDoctor, so they eat their own dog food. AsciiDoctor supports variables so I could hide the file server URL of the MP3 files in a variable. This allows me to quickly change the uri of the files should Allison decide that the MP3 files needed to be relocated. AsciiDoctor also supports if-statements, so I could have different content in different versions. Like an embedded audio player in the regular ePub but not in the Apple Book Store version.

For me the decision was clear. I would convert the documents to AsciiDoctor. I just hoped that Bart would approve my decision.

Q – How did you convert the Markdown to AsciiDoctor?

The syntax of Markdown and Asciidoctor is quite similar. The major differences are the support for various blocks in Asciidoctor. A block holds any kind of text that should be offset from the main text such as source code blocks but also asides.

The Asciidoctor project supports a conversion tool called kramdoc (matthewsetter.com/technical-documentation/asciidoc/convert-markdown-to-asciidoc-with-kramdoc/) that can convert Kramdown (a flavor of Markdown) to Asciidoctor. So I used that to do the initial conversion and then go manually through the Asciidoctor files and fix all kinds of issues like the Aside blocks, the code blocks, the definition lists.

Q – You ran the pages through the free Grammarly tool to look for spelling errors (even though Allison was going to proofread)

Somewhere along the line I ran all the text through Grammarly, I can’t remember if it was the Markdown version or the Asciidoctor version, but each and every episode was copied to Grammarly and all the typos I was pretty sure about were fixed. The rest of them I left to Allison to proofread. 😀She upped the task by also checking each and every command.

Allison suggested adding QR Codes for the link to the episodes so you could listen on your phone when your ebook reader doesn’t support the built-in audio player. I thought it would be a daunting task, but I quickly found a command-line utility in Node.js that could generate a QR Code, given the url, the size and the color of the output image. Wrapped it in a shell script that ran over all the MP3 file links and I had the QR codes.

It took quite some time before I had the audio block working, because that’s the one with the most variations between the various output formats of the ebook. PDFs and Kindle books don’t support an embedded audio player, so I had to use various if statements to put it in or leave it out based on what format I was building.

Q – Explain the Spline

According to the ePUB specifications, a spine is a document that defines all the ‘chapters’ of an ebook and the order in which they should appear. It’s a kind of ToC but with all the metadata such as book title, author, which source-code highlighter to use, etc.

When studying the Pro Git 2 repository I noticed they use a Rakefile (Ruby makefile) to build the various output formats of their book. That sounded like a good thing, because then I would not only have an ePub, but also a PDF and an HTML version of the book. At first I commented out the Kindle version. So copy, paste and modify to get my version working. I was treating the project as a software project, rather than a book publishing project.

This also allowed me to check my progress in the HTML version and then do a quick check in the PDF and ePub version. This works much faster.

Allison didn’t feel like installing all the bits and pieces I had installed just to be able to run the build script. So I decided that I needed an automated build. We live in different time zones and it would be slow and tedious if she had to wait for me to wake up and build a version she could use to check her commits.

So I was looking at GitHub Actions on how to do that. I found an example using Vale (github.com/errata-ai/vale), a linter that can do spell checking. A linter is a tool that can analyze source code for bugs, suspicious code and style errors. I thought that spell checking would be a good idea anyway, because it could help in the first round of typo spotting. So I created a workflow that installed Vale and had it run on every commit. And it worked!

So next up was creating a workflow that would be able to run the build script. So I got that working too. Such a workflow actually installs a new virtual machine like Ubuntu and follows the steps you’ve defined in the workflow like installing the various tools, running the build script and then uploading the output formats as a release, so it’s downloadable for everyone. Finally the virtual machine is thrown away.

A workflow file is written in YAML (https://en.wikipedia.org/wiki/YAML) which is a data-oriented language that, like Python, defines the structure by whitespaces. YAML and JSON are very similar and can most of the time be converted from one into the other without data loss.

When I had the build script running I wanted to fix the line numbers in the source code blocks. Bart used line numbers, so I wanted to do that too. But the implementation was a bit odd, so I tried to hack the code in Asciidoctor. Ahem, I added code to the Rouge highlighter to change the way the line numbers were generated. I almost had it working, even the unit tests worked, but I broke an important feature, so finally I decided to drop it and remove the line numbers from most of the longer source code blocks.

I found I could highlight lines of code, like with a yellow marker, so I went through all of Bart’s original pages and looked for any indication of highlighting and added the correct markup in the Asciidoctor version.

I wanted to fix the theme of the ebook, but that meant 3 different themes. One for the PDF, one for the ePub, and one for the HTML. I decided to start with the PDF. That would be the easiest to check. I got it to work for the headers, same font and color as Bart used, but I had to revert back to the standard base font for the body text because all kinds of weird issues would arise.

The ePub was even harder to theme. Then Allison pointed out that most ePub readers allow the reader to select his or her own font and font size, so I gave up on that. And the HTML I never got round to styling.

Final Thoughts on the Asciidoctor project

Super helpful group. I’ve had earlier experiences with other software projects where I asked for help and was basically told ‘if you don’t understand the underlying technique, start learning that first’. Not these guys. They did kindly refer to the user manual, which is VERY elaborate, but didn’t have the answer to my problem. But after my explanation that I followed their advice they chipped in and finally pointed me to the problem. The lead developer even went as far as offering to have a look at my code to see if there are other problems I might run into.

Lead developer actually checked out the code and advised to update asciidoctor-pdf to a more recent version. The developer of asciidoctor-epub was very helpful when I asked when version alpha.16 was released. I became a beta tester and today it was released.

I asked if an error message could be fixed and 2 hours later it was fixed in the next release. When the release was made, it was dedicated to me!

1 thought on “CCATP #649 – Dr. Helma van der Linden on Creating a Book with Open Source Software”

Claus - August 14, 2020

I really loved this episode, because it shows that building an eBook can be done, it can be done well and it can be automated to some extend by utilizing some of the skills of software development. But maybe most importantly, it doesn’t take a big publishing company to do it, but “just” a few very talented and dedicated people. So why is that so cool?

There is tremendous potential, if this knowledge was more readily applied by teachers and university lecturers. The Open Textbook initiative (see https://en.wikipedia.org/wiki/Open_textbook) might get many new books. Add a healthy dose of collaboration and it might make those books even better than they already are.

Thank you so much Helma and Allison for this podcast episode and for creating the book. Consider making this episode the “afterword” in the book, I’d think it’d be fitting!

1 thought on “CCATP #649 – Dr. Helma van der Linden on Creating a Book with Open Source Software”

Leave a Reply