Adding Syntax Highlighters

Laika has its own built-in syntax highlighters (based on its parser combinators).

This manual itself is a showcase for this functionality. All code samples shown are highlighted by Laika's own syntax support.

This chapter explains the general design goals as well as the basic building blocks for adding your own highlighters.

Design Goals

Prerequisites

The content of this chapter builds on top of concepts introduced in other chapters.

First, even though Laika's highlighting support defines a lot of higher level building blocks that allow for a mere declarative assembling of the supported constructs for a language, some additions of custom parsers might be necessary, depending on the complexity of the target language. In that case it's recommended to use Laika's Parser Combinators.

Second, highlighters produce a list of CodeSpan nodes which is one node type of Laika's document AST. Even though this is usually the only node type you deal with, it might help to get familiar with The Document AST first.

The SyntaxHighlighter Trait

These are the two abstract methods a new highlighter needs to implement:

def language: NonEmptyList[String]

def spanParsers: Seq[CodeSpanParser]

Let's start by looking at how you can assemble these span parsers by using some of the building blocks for common language constructs.

We later look at how to add other constructs based on custom parsers.

Using Existing Building Blocks

The building blocks shown in this section are meant to be used as shortcuts. Whenever they do not provide the exact syntax option that you need, you can always fall back to Defining Custom Parsers.

Our code examples in this section define highlighters for a Scala-ish language, but omit many details for the sake of brevity. You can examine the source code of Laika's built-in highlighters for more complete examples.

String Literals

There are shortcuts for defining single- and multi-line strings together with escape sequences:

import laika.parse.code.CodeSpanParser
import laika.parse.code.common._

val spanParsers: Seq[CodeSpanParser] = Seq(
  StringLiteral.multiLine("\"\"\""),
  StringLiteral.singleLine('"').embed(
    StringLiteral.Escape.unicode ++ StringLiteral.Escape.char
  )
)

The multiline and singleLine builders both require to specify the delimiter. When the start and end delimiter are identical you can use the single-argument overload.

For a single line string we also define valid escape sequences which will be used to detect nested spans. These are existing building blocks, too.

There are additional escape options like octal and hex, as well as building blocks to define substitutions, e.g. Scala's s"some $ref" syntax.

Character Literals

Character literals have fewer options than string literals, but are otherwise quite similar:

import laika.parse.code.CodeSpanParser
import laika.parse.code.common._

val spanParsers: Seq[CodeSpanParser] = Seq(
  CharLiteral.standard.embed(
    StringLiteral.Escape.unicode ++ StringLiteral.Escape.char
  )
)

The standard builder uses single quotes as delimiters, but you can specify a different one if required. Like with our string literal example, we define two kinds of escapes that can occur in a character literal.

Numeric Literals

There are shortcuts for the most common types of decimal, hex, octal and binary number literals:

import laika.parse.code.CodeSpanParser
import laika.parse.code.common._

val spanParsers: Seq[CodeSpanParser] = Seq(
  NumberLiteral.hex
    .withUnderscores
    .withSuffix(NumberLiteral.suffix.long),
  NumberLiteral.decimalInt
    .withUnderscores
    .withSuffix(NumberLiteral.suffix.long | NumberLiteral.suffix.float)
)

Here we are defining two types of number literals: hexadecimal and decimal integers.

In both cases we allow the use of underscores as separators (e.g. 1_000_000).

We also define which kinds of type indicators we allow as a suffix and again use existing parsers for widely used syntax (NumericSuffix.long parses L or l for example).

Identifiers

This is how the identifier category is defined for the Scala highlighter:

val identifier = Identifier.alphaNum
  .withIdStartChars('_','$')
  .withCategoryChooser(Identifier.upperCaseTypeName)

It expects any alphanumerical identifier that is also allowed to have _ or $ as a start character.

We then pass a function as a "category chooser" that assigns a category based on the parsed identifier. It is a very simple function that serves as a relatively useful approximation: it picks CodeCategory.TypeName when the identifier starts with an uppercase letter, otherwise it assigns CodeCategory.Identifier.

This is obviously not always correct, but good enough for most use cases. As discussed in the design goals, this highlighting utility does not aim to get close to defining full lexers and parsers, and assigning the category correctly in all cases would require full context of the parsed construct. Many highlighters you might be using with other tools are equally pragmatic.

Keywords

Keywords can simply be listed as string literals:

val keywords =
  Keywords("abstract", "case", "catch", "class", "def")

It detects word boundaries so that it will not match on sub-strings.

By default the parser will assign CodeCategory.Keyword to any matching string. If you want to use literal matches, but with a different code category, there is an alternative constructor:

import laika.parse.code.CodeCategory

Keywords(CodeCategory.BooleanLiteral)("true", "false")

Comments

Shortcuts exist for defining single- and multi-line comments:

val spanParsers: Seq[CodeSpanParser] = Seq(
  Comment.singleLine("//"),
  Comment.multiLine("/*", "*/")
)

The code should be self-explanatory here, we simply define the delimiters for the comment syntax of our language. The single-line variant terminates at the end of the line.

Defining Custom Parsers

The basic building blocks presented in the previous section help a lot with reducing boilerplate, but in most cases you end up with the need to have at least a handful of hand-written parsers, too.

For a little example, let's implement a highlighter for Scala's backtick identifiers (e.g. `tag-name`) which is quite straightforward, but not included in the reusable builders since it's not a very common construct:

import laika.parse.builders._
import laika.parse.code.CodeCategory

val backtickId: CodeSpanParser = CodeSpanParser(CodeCategory.Identifier) {
    (oneOf('`') ~ anyNot('\n', '`') ~ oneOf('`')).source
  }

In the first line of the definition we specify the category that should be assigned to the result of the parser.

The next line is a standard parser definition using some of the building blocks of Laika's Parser Combinators. We simply expect a backtick as the start and end delimiter and allow anything in-between apart from newline or backtick characters.

The call to .source is a convenient shortcut that discards the result of the three parsers and instead provides the source of the entire input by those three parsers as the result. This is often preferable when all we would do with the three results would be concatenate them back to one string.

The code parser we now defined can then be registered in our spanParsers sequence like in all other examples.

The entry point above is just one option. It's only appropriate if we provide a string parser and want the same category assigned to the entire string.

There is an alternative factory method, where we specify a PrefixedParser[Seq[CodeSpan]] where we have full control over how we slice the input into multiple categories.

Embedding other Languages

In some cases a language or format may have regions that require a highlighter for a different syntax. A classic example is HTML which may have CSS syntax inside a <style> tag and JavaScript syntax inside a <script> tag.

Laika comes with a parser builder that helps with the creation of such a parser.

Let's pick CSS in HTML as an example and show a simplified definition for a such a style tag:

import laika.ast.CodeSpan
import laika.parse.builders._
import laika.parse.code.CodeCategory
import laika.parse.code.common.EmbeddedCodeSpans
import laika.parse.code.languages.CSSSyntax
import laika.parse.text.PrefixedParser

val styleTagParser: PrefixedParser[Seq[CodeSpan]] = {
  val cat = CodeCategory.Tag.Name
  val bodyAndEndTag = 
    EmbeddedCodeSpans.parser(delimitedBy("</style>"), CSSSyntax)
    
  (literal("<style>") ~> bodyAndEndTag).map { css =>
    CodeSpan("<style>", cat) +: css :+ CodeSpan("</style>", cat)
  }
}

The key here is the third line. We define a region to parse with delimitedBy("</style>") which consumes text until it sees the specified end delimiter. See Delimiters for details. And we specify the syntax to switch to until we reach that delimiter, in this case the existing CSSSyntax

The result (the css param) is of type Seq[CodeSpan], containing the spans detected by the CSS highlighter. We then wrap the code spans for the start tag and end tag around it for the final result.

One of the simplifications to keep the example simple is that we assign the category CodeCategory.Tag.Name to the entire tag, including the angle brackets. Normally you would keep them separate under CodeCategory.Tag.Punctuation.

Parser Precedence

The precedence rules are identical to those for declaring markup parsers.

The parser precedence is determined by the order you specify them in. This means they will be "tried" on input in that exact order. The second parser in the list will only be invoked on a particular input when the first fails, and so on.

This is the logical execution model only, the exact runtime behaviour may differ due to performance optimizations, but without breaking the guarantees of respecting the order you specify.

In many cases the difference in the syntax between constructs is high enough that the precedence does not matter. But in some cases extra care is needed.

If, for example, you provide a parser for a span delimited by one or two occurrences of the same character, the two-character version must be specified first, as otherwise the parser with the single-character delimiter would 100% shadow the double one and consume all matching input itself, unless it contains a guard against it.

Similarly, parsers like those for identifiers which do not have a limited set of start delimiters usually need to occur near the bottom of the list of parsers you register.

Registering a Highlighter

First you have to assemble all the parsers in a SyntaxHighlighter implementation:

import cats.data.NonEmptyList
import laika.api.bundle.SyntaxHighlighter
import laika.parse.code.CodeSpanParser

object FooHighlighter extends SyntaxHighlighter {

  val language: NonEmptyList[String] = NonEmptyList.one("foo")
  
  val spanParsers: Seq[CodeSpanParser] = Seq(
    // all your definitions here
  )

}

Finally, like all other types of extensions, the highlighter needs to be registered with an ExtensionBundle:

import laika.api.bundle.{ ExtensionBundle, ParserBundle }

case object MyExtensions extends ExtensionBundle {
  
  val description: String = "Foo Syntax Highlighter"
  
  override def parsers: ParserBundle = new ParserBundle(
    syntaxHighlighters = Seq(
      FooHighlighter
    )
  )
}

You can bundle multiple highlighters in a single instance.

Finally, you can register your extension together with any built-in extensions you may use:

import laika.format.Markdown

laikaExtensions := Seq(
  Markdown.GitHubFlavor,
  MyExtensions
)
import laika.api._
import laika.format._

val transformer = Transformer
  .from(Markdown)
  .to(HTML)
  .using(Markdown.GitHubFlavor)
  .using(MyExtensions)
  .build