New Markup or Output Formats

All the other chapters in the "Extending Laika" section of the manual deal with a customization option that at some point becomes part of an ExtensionBundle. As Anatomy of the API showed, it is one of the major API hooks for building a transformer:

Anatomy of the API

While an ExtensionBundle is about enhancing functionality for existing input and output formats, this chapter finally is about actually adding new formats. Therefore it deals with the other two types in the diagram: MarkupFormat and RenderFormat[FMT].

Implementing a Markup Format

Laika currently supports two markup formats out of the box: Markdown and reStructuredText. Having two formats from the beginning greatly helped in shaping a document model that is not tied to the specifics of a particular text markup language.

Making it as straightforward as possible to add support for more formats like ASCIIDoc or Textile was one of Laika's initial design goals.

Parser Prerequisites

The content of this chapter builds on top of concepts introduced in other chapters, therefore it's recommended to read those first.

First, all parsers build on top of Laika's Parser Combinators. Having its own implementation helps with keeping all functionality tightly integrated and adding some optimizations useful for the specific use case of markup parsing right into the base parsers.

Second, since block and span parsers that together provide the markup implementation all produce AST nodes, it might help to get familiar with The Document AST first.

Finally, Writing Parser Extensions covers a lot of ground relevant for adding a new format, too. It walks through the sample implementation of a span parser.

The main difference is merely that a block or span parser serving as an extension for existing markup languages will be registered with an ExtensionBundle, while the parsers for a new format need to be registered in a MarkupFormat.

The MarkupFormat Trait

The contract a markup implementation has to adhere to is captured in the following trait:

import laika.api.bundle.{ BlockParserBuilder, ExtensionBundle, SpanParserBuilder }
import laika.api.format.MarkupFormat.MarkupParsers

trait MarkupFormat {

  def fileSuffixes: Set[String]

  def blockParsers: MarkupParsers[BlockParserBuilder]
  
  def spanParsers: MarkupParsers[SpanParserBuilder]

  def extensions: Seq[ExtensionBundle]
  
}

These are the four abstract method each parser has to implement.

Finally, there are three concrete methods that may be overridden if required:

def description: String = toString

def escapedChar: Parser[String] = TextParsers.oneChar

def createBlockListParser (parser: Parser[Block]): Parser[Seq[Block]] = 
  (parser <~ opt(blankLines)).rep

Parser Precedence

The parser precedence is determined by the order you specify them in. This means they will be "tried" on input in that exact order. The second parser in the list will only be invoked on a particular input when the first fails, and so on.

This is the logical execution model only, the exact runtime behaviour may differ due to performance optimizations, but without breaking the guarantees of respecting the order you specify.

Normally the difference in the syntax between markup constructs is high enough that the precedence does not matter. But in some cases extra care is needed. If, for example, you provide parsers for spans between double asterisk ** and between single asterisk *, the former must be specified first, as otherwise the single asterisk parser would 100% shadow the double one and consume all matching input itself, unless it contains a guard against it.

Implementing a Render Format

Laika currently supports several output formats out of the box: HTML, EPUB, PDF, XSL-FO and AST. XSL-FO mostly serves as an interim format for PDF output, but can also be used as the final target format. AST is a renderer that provides a formatted output of the document AST for debugging purposes.

Making it as straightforward as possible to add support for additional output formats, potentially as a 3rd-party library, was one of Laika's initial design goals.

Renderer Prerequisites

The content of this chapter builds on top of concepts introduced in other chapters, therefore it's recommended to read those first.

First, since renderers have to pattern match on AST nodes the engine passes over, it might help to get familiar with The Document AST first.

Second, Overriding Renderers shows examples for how to override the renderer of a particular output format for one or more specific AST node types only.

The main difference is that a renderer serving as an override for existing output formats will be registered with an ExtensionBundle, while the renderers for a new format need to be registered in a RenderFormat.

The second difference is that a RenderFormat naturally has to deal with all potential AST nodes, not just a subset like an override.

The RenderFormat Trait

A renderer has to implement the following trait:

import laika.ast.Element
import laika.api.format.Formatter

trait RenderFormat[FMT] {
  
  def fileSuffix: String
  
  def defaultRenderer: (FMT, Element) => String
  
  def formatterFactory: Formatter.Context[FMT] => FMT

}

The Render Function

This defaultRenderer function should usually adhere to these rules:

Let's look at a minimal excerpt of a hypothetical HTML render function:

import laika.ast._
import laika.api.format.TagFormatter

def renderElement (fmt: TagFormatter, elem: Element): String = {

  elem match {
    case p: Paragraph => fmt.element("p", p)
    
    case e: Emphasized => fmt.element("em", e)
    
    /* [other cases ...] */
    
    /* [fallbacks for unknown elements] */
  }   
}

As you see, the function never deals with children (the content attribute of many node types) directly. Instead it passes them to the Formatter API which delegates to the composed render function. This way user-specified render overrides can kick in on every step of the recursion.

In the context of HTML it means that in most cases your implementation renders one tag only before delegating, in the example above those are <p> and <em> tags respectively.

Choosing a Formatter API

Depending on the target format your renderer may use the Formatter or TagFormatter APIs, which are explained in The Formatter APIs.

Alternatively it may create its own API, but you should keep in mind then, that this API will also get used by users overriding renderers for specific nodes. Therefore, it should be convenient and straightforward to use and well documented (e.g. full scaladoc).

Even when creating your own formatter it's probably most convenient to at least extend Formatter, which contains base logic for indentation and delegating to child renderers.

Costs of Avoiding Side Effects

As you have seen the render function returns a String value, which the engine will then build up recursively to represent the final output. It can therefore get implemented as a pure function, fully referentially transparent.

Earlier Laika releases had a different API which was side-effecting and returning Unit. Renderers directly wrote to the output stream, only hidden behind a generic, side-effecting delegate API.

Version 0.12 in 2019 then introduced full referential transparency and one of the necessary changes was the change of the render function signature. These changes (taken together, not specifically that for the Render API) caused a performance drop of roughly 10%. It felt reasonable to accept this cost given how much it cleaned up the API and how it lifted the library to meet expectations of developers who prefer a purely functional programming style.

The decent performance of Laika stems mostly from a few radical optimization on the parser side, which led to much better performance compared to some older combinator-based Markdown parsers.

The alternative would have been to build on top of a functional streaming API like fs2, as this might have preserved both, the old performance characteristics as well as full referential transparency. But it still would have complicated the API and introduced a dependency that is not needed anywhere else in the laika-core module, which does not even require cats-effect.