Monday, 1 April 2019

Website Updates, Part 1

I finished mapgen4 last week and wanted to "sharpen my tools" before I started the next project. My website is over 20 years old and spans 5 domains. The build process has grown over time, supporting all the different tools I've used over the decades.

There are a few features I've wanted to add to my static site generator:

  1. Versioning of js, css resources so that I can increase the expiration cache time. This should improve load time. I've been doing this manually and want to automate it.
  2. Dependency tracking, so that if something includes another thing, and that thing changes, the first thing gets rebuilt. This should reduce content mismatch errors.
  3. Custom macros for each project that get expanded at build time. This should make it easier to write new pages.

The current system is mostly based on XHTML with my own custom tags. For example, <x:published> lets me mark a publish date at the top of the page, <x:draft> lets me mark the page as a draft/beta, and <x:nocomments> lets me turn off comments on a page. I use regular expressions and XSLT for transforming the XHTML into HTML, expanding the custom tags and also adding custom transforms for existing tags. I'd like to add project-specific tags to this.

I wanted to replace the sed + xsltproc bash script with a Python program. Once I have the XHTML read into Python, I can add versioning, record dependenies, and add custom macros. Also in Python, I can also manage the site build, including generating sitemaps for search engines, determining what to rebuild, and uploading to the server.

I decided to start with the code for processing a single XHTML file:

  1. A bunch of sed commands to perform regular expression substitutions.
  2. SmartyPants.pl for converting regular quotes to "smart quotes".
  3. The xsltproc command.

What are the Python alternatives to each of these?

(1) The regular expression substitutions are easy to do in Python.

(2) For SmartyPants.pl, I evaluated the Python port, smartypants.py, which would allow me to run this step in memory instead of invoking a separate process. I like the library, and it offers many useful options that would be an improvement over the Perl script, including being able to exclude my custom tags. Unfortunately it doesn't produce the same output. Two examples:

  • instead of <tag>Name</tag>'s (close) it produces <tag>Name</tag>'s (open)
  • instead of "<tag>...</tag>" (open,close) it produces "<tag>...</tag>" (open,open)

When the character before the quote crosses a tag, it seems to not take that into account when deciding open or close quotes, and it ends up outputting an open quote when it should output a close quote. I looked through the Perl and Python code and couldn't figure out how the Perl code was handling these cases.

I decided to look through existing issues on the project page, and file one if there weren't any existing ones matching this situation. Unfortunately the repository has been deleted. The documentation says the maintainer is looking for someone to take over. So it seems like the project is abandoned.

I also started wondering if I can reimplement the algorithm myself. Before going down that rabbit hole, I decided to stick to abandon this change. This means I'll have to pipe everything through an external process, and I won't be able to handle custom tags.

(3) The next step was to look at an alternative to xsltproc. Python has a library called lxml that can do lots of nice things, including XSLT transforms. That means it should be able to take my existing XSLT file and apply it to the XHTML documents, producing HTML documents. But can it?

Yes! It worked beautifully. And it gave me back a parse tree which would allow me to easily add versioning and dependency tracking, the main features I wanted to add. Here's how I ran it:

# Load the XSLT, and allow it to use xi:include xsl = etree.parse(open(xslt_filename, 'r')) xsl.xinclude()  # Load the document, and allow it to use xi:include bxml = etree.parse(open(document_filename, 'r')) bxml.xinclude()  # Apply the XSLT to the document html = etree.XSLT(xsl)(bxml) sys.stdout.write(str(html)) 

The output was exactly the same as using xsltproc. I'm very happy with lxml! It's easy to use and allows me to add the features I wanted.

Except …

Switching from sed + xsltproc slowed down the builds quite a bit, from 0.2s to 0.7s for a single file. That means it no longer feels instantaneous. The full build went from 65s to 200s. I'm sure I can speed up the full build, but I don't have an easy way to speed up the development builds. The startup is too long, and I didn't find any easy ways to fix that.

So I decided to abandon this change.

It's always hard to abandon things you spent time and energy designing and implementing. But it's sometimes the best thing to do.

I stepped back and looked at my main goals:

  1. Versioning of js, css resources
  2. Dependency tracking
  3. Custom macros for each project

How important are these? Can I implement them a different way, while keeping the development builds fast?

  1. I don't need versioning for development builds.
  2. I might not need dependency tracking for development builds.
  3. I haven't needed custom macros so far and can live without them.

My new plan is to keep the sed + xsltproc for development builds, and inject version numbers for the production build only. I'm undecided about how I'll implement dependency tracking. In part 2 I'll describe how I implemented these. I will leave custom macros for another time.