Language Feature

Metaprogramming in OCaml: Extension Points and PPX

Illustrated by Julia Hanke

Metaprogramming is a technique in which programs modify themselves at compile or runtime. This can be achieved in various ways, for example through reflection (C#, Java, Ruby, Smalltalk), templates (C++, D, Template Haskell), or macros (Lisp, Elixir).

The OCaml language offers a distinct approach to compile-time code generation in the form of extension points and AST rewriters, which we'll explore in this article.

Abstract and Concrete Syntax Trees

In computer science, the term "abstract syntax tree" refers to a tree representation of a program's structure. In these trees, nodes are language constructs, like function invocations or operators, with leaf nodes generally being either variables, constants, or literals. As an example, a possible AST for the expression 1 + 2 * 3 could look like this:

Abstract Syntax Tree representation

These tree structures are called abstract because they may not contain every detail of the program's code. For example, comments would generally not be part of the AST. This is in contrast to concrete syntax trees, which are more commonly called "parse trees." These are usually built by a programming language's parser during program interpretation/compilation and include additional semantic information about the different nodes:

Concrete syntax tree representation

Note that in common usage the term "AST" is often loosely used to include both abstract and concrete syntax trees, and I'll be using it much the same way throughout the rest of this text.

Why are programs turned into tree data structures in the first place? One important reason is that they can easily be programmatically modified. This can happen during compilation for optimization or other purposes. Some languages also expose this functionality to developers, most famously Lisp in the form of macros. Macros can be thought of as similar to templates, which transform parts of an AST into a different form.

For example the macro when functions like an if without an else clause:

(when (> 1 2) (format t "~s" "true"))

Such a macro could be defined in the following way:

(defmacro when (condition &rest body)
  `(if ,condition (progn ,@body)))

When the compiler encounters this macro, it will replace it with the content of the macro definition, a process known as "expansion":

(macroexpand '(when (> 1 2) (format t "~s" "true")))
; (IF (> 1 2) (PROGN (FORMAT T "~s" "true")))

Some constructs need to be "unquoted" (interpolated into the resulting code), which is achieved by prefixing them with a comma, like ,if and ,@body. The @ sign "splices" a list into the enclosing list, like ... in JavaScript arrays.

It's important to note that it would not be possible to write when as a function since all the arguments would be evaluated and "true" would be printed before the condition gets checked.

OCaml Extension Points and PPX

While OCaml doesn't have macros, it offers a different code rewriting mechanism in the form of extension points and AST rewriters commonly referred to as PPX (short for PreProcessor eXtensions). These were introduced in version 4.02 of the language, and they replaced the older and more complex campl4p as extension API.

Attributes and Extension Nodes

The extension point API introduced a notation for attributes and extension nodes, which allow tools and libraries like PPX rewriters to modify the parse tree of an OCaml program.

Attributes are ignored by the type checker, so if there's no extension to handle them, they'll be silently discarded. They are generally used to add new AST nodes and are prefixed by an @ character, for example, @attribute.

let x = 1 [@attr]

Extension nodes, on the other hand, are like placeholders in the source code and are meant to be "expanded" (like in our Lisp macro example) by a PPX rewriter, replacing the existing AST nodes in the process. The type checker will raise an error if it encounters an unexpanded node. They are prefixed by a % character (%extension).

let y = [%ext]

Both attributes and extension nodes can receive optional payloads, which can be thought of as arguments to the extension function:

type t = | A [@id 1] | B [@id2]
let user = [%getenv "USER"]

Payloads

There are three forms of payloads representing various parts of OCaml's object and type languages. A space character denotes a payload that's a module item, and a colon is used for type expressions or specifications. Lastly a question mark denotes an extension or attribute related to a pattern. Both attributes and extension nodes support all three payload types:

let x = [@attr "module item"]
let y = [%ext: "type expression or specification"]
let z = [%ext? "pattern"]

Alternative Forms

While @attr and %expr are used for attaching to expressions, module expressions, class expressions, and everything else OCaml's documentation refers to as "algebraic categories," there are alternative @@attr (used in structures and signatures) and %%expr (used in type declarations, class fields, etc.) forms. Additionally attributes support a stand-alone form (@@@attr), which is not attached to any specific AST node.

Last, but certainly not least, attributes and extension nodes offer an infix syntax. While this is pretty rare for the former, it's heavily used for the latter:

let%ext x = 1 in ...
(* [%ext let x = 1 in ... ) *)

match%ext x with ...
(* [%ext match x with ...] *)

try%ext f x with _ -> ...
(* [%ext try f x with _ -> ...] *)

Writing a PPX

Now that we know what an OCaml PPX can do, it's time to write our own. But before we do that, let's get a feeling for what we are dealing with. To see the AST of OCaml expressions, we have several options. One is starting utop with the -dparsetree flag, which will print the AST of each expression evaluated at the toplevel:

utop # 1 + 1;;

Ptop_def
  [
    structure_item (//toplevel//[1,0+0]..[1,0+3])
      Pstr_eval
      expression (//toplevel//[1,0+0]..[1,0+3])
        Pexp_apply
        expression (//toplevel//[1,0+1]..[1,0+2])
          Pexp_ident "+" (//toplevel//[1,0+1]..[1,0+2])
        [
          <arg>
          Nolabel
            expression (//toplevel//[1,0+0]..[1,0+1])
              Pexp_constant PConst_int (1,None)
          <arg>
          Nolabel
            expression (//toplevel//[1,0+2]..[1,0+3])
              Pexp_constant PConst_int (1,None)
        ]
  ]

- : int = 2

Another way to inspect OCaml parse trees is the ppx_tools package, which lets us dump the AST of expressions like this ocamlfind ppx_tools/dumpast -e '1 + 1':

1 + 1
==>
{pexp_desc =
  Pexp_apply
   ({pexp_desc = Pexp_ident {txt = Lident "+"}; pexp_loc_stack = []},
   [(Nolabel,
     {pexp_desc = Pexp_constant (Pconst_integer ("1", None));
      pexp_loc_stack = []});
    (Nolabel,
     {pexp_desc = Pexp_constant (Pconst_integer ("1", None));
      pexp_loc_stack = []})]);
 pexp_loc_stack = []}
=========

dumpast can also be used to display the parse tree of a whole file. For example, if we have a file called test.ml with the following content,

let x = 1 in  x + 1

we can inspect the AST with ocamlfind ppx_tools/dumpast test.ml:

test.ml
==>
[{pstr_desc =
   Pstr_eval
    ({pexp_desc =
       Pexp_let (Nonrecursive,
        [{pvb_pat = {ppat_desc = Ppat_var {txt = "x"}; ppat_loc_stack = []};
          pvb_expr =
           {pexp_desc = Pexp_constant (Pconst_integer ("1", None));
            pexp_loc_stack = []}}],
        {pexp_desc =
          Pexp_apply
           ({pexp_desc = Pexp_ident {txt = Lident "+"}; pexp_loc_stack = []},
           [(Nolabel,
             {pexp_desc = Pexp_ident {txt = Lident "x"}; pexp_loc_stack = []});
            (Nolabel,
             {pexp_desc = Pexp_constant (Pconst_integer ("1", None));
              pexp_loc_stack = []})]);
         pexp_loc_stack = []});
      pexp_loc_stack = []},
    ...)}]
=========

A Basic PPX

It's time to write our own small PPX! We will use ppxlib for this, a project that merged several older PPX libraries into a comprehensive new one. The goal is to write a minimal extension called "uc" (for "uppercase"), which will replace a string by its uppercase version. It will be used as follows:

let reading = [%uc "human readable"] in ()

which will get expanded to the following code:

let reading = "HUMAN READABLE" in ()

While this is not terribly useful, it's enough to get a basic understanding of how ppxlib works, and we'll build on this knowledge in the next example.

Let's look at our extension's full code first before discussing it in more detail:

open Ppxlib

(* 2 *)
let name = "uc"

(* 4 *)
let expand ~loc ~path:_ s =
  let uc = String.uppercase_ascii s in
  (* 5 *)
  [%expr [%e Ast_builder.Default.estring uc ~loc]]
;;

(* 3 *)
let ext =
  Extension.declare
    name
    Extension.Context.expression
    Ast_pattern.(single_expr_payload (estring __))
    expand
;;

(* 1 *)
let () = Driver.register_transformation name ~extensions:[ ext ]

This code is easier to understand when reading it "backward." In (1) we register our transformations with ppxlib's driver. For this, we need a name (here "uc", defined at (2)) as well as a list of extensions, which in our case consists of the single element ext. This extension is defined in (3) with the Extension.declare function, which takes four arguments:

  1. A name ("uc", again provided by the function defined at (2)).
  2. A context of type Ppxlib.Extension.Context.t, which defines what type of AST node we will be replacing. In our example, that's Pppxlib.Extension.Context.expression, as we'll be replacing an expression (i.e., the right-hand side of a let statement). Other possible options include Pattern for pattern match clauses or Module_expr for module expressions.
  3. An AST pattern of type Ppxlib.Ast_pattern.t, which specifies what type of AST node we want to replace and what values we want to extract from it. Here we specify that we're going to replace a single expression (single_expr_payload), which consists of a string (estring). The __ is used as a placeholder for the captured value.
  4. An expander function, which is the heart of our extension and returns the replacement AST node.

In this example, the expander function is called expand and defined at (4). This function is called with several arguments: ~loc is the location of our expansion point (a Ppxlib.Location.t) and ~path contains the full path to the expanded node, including modules (for example test.ml.TestModule). Last, but not least, we will receive the extracted string (the __ from the Ast_pattern discussed earlier) in an argument named s.

The function itself is easy: we convert the string to uppercase with String.uppercase_ascii and then return a new AST node. Constructing these nodes manually is both tedious and error-prone, so we use ppxlib's metaquot plugin, which provides an easier way to write AST fragments as source code. Here %expr signifies that we're trying to generate an expression node and %e is what metaquot calls "anti-quotation," which allows us to include dynamically generated values in our AST nodes. So the snippet

[%e Ast_builder.Default.estring uc ~loc]

essentially says "create a string node with the content of the variable uc and insert it at the location specified by ~loc.

To build our new PPX we need to include the ppxlib library dependency and preprocess our source code with metaquot. This can be achieved with the following Dune stanza:

(library
 (name ppx_uc)
 (libraries ppxlib)
 (preprocess
  (pps (ppxlib.metaquot))))

In programs that want to use our new PPX, we will need to include it (and ppxlib) as dependencies and register it as a preprocessor:

(executable
  (name uc_test)
  (libraries ppx_uc ppxlib)
  (preprocess
    (pps ppx_uc)))

A More Interesting PPX

While the previous example wasn't particularly exciting, it's not much more difficult to make something at least slightly useful. Our next PPX called env will receive the name of an environment variable and a default value as a tuple and then expand the expression accordingly:

let read = [%env "READ", "Human Readable"] in ()
(* let read = "HumanReadable" in () *)
let editor = [%env "EDITOR", "nano"] in ()
(* let editor = "vim" in () *)

The source code should look familiar at this point. Take a moment to see if you can figure out how this works:

open Ppxlib

let name = "env"

let expand ~loc ~path:_ var default =
  let value =
    (* 2 *)
    match Caml.Sys.getenv_opt var with
    | Some s -> s
    | None -> default
  in
  (* 3 *)
  [%expr [%e Ast_builder.Default.estring value ~loc]]

let ext =
  Extension.declare
    name
    Extension.Context.expression
    (* 1 *)
    Ast_pattern.(single_expr_payload (pexp_tuple ((estring __)^::(estring __)^::nil)))
    expand

let () = Driver.register_transformation name ~extensions:[ ext ]

This looks pretty similar to our previous PPX, doesn't it? The main difference is the pattern we are using as the third argument to Extension.declare (3). Instead of a single string, we're now dealing with a tuple expression (pexp_tuple) consisting of two strings (estring). The ^:: operator turns them into a two-element list, which will be passed as two parameters (var and default) to the expand function. The expander itself is straightforward again: it first tries to retrieve the value of the environment variable and falls back to the provided default value if the variable is unset. At (3) we again construct a string AST node with the correct value.

While these examples offer a first glimpse at what ppxlib can do, the library has a lot more to offer. For example, patterns can specify alternatives via the alt function (or ||| operator) and more complex patterns can be constructed to match more than a single OCaml expression. I encourage you to play around with the examples provided here and see what useful extensions you can come up with!

PPX in the Wild

Extensions

After this brief introduction to developing PPX rewriters, let's explore some of the ones commonly found in OCaml projects.

Lwt, a library for concurrent I/O based on promises, is not only widely used, but also makes heavy use of syntax extensions defined in the Ppx_lwt module. One such example is the infix version of bind, which attaches callbacks to promises. The expression

let%lwt ch = get_char stdin in
...

will get rewritten to

bind (get_char stdin) (fun ch -> ...)

This becomes especially useful for nested binds, where the let%lwt form reduces nesting and the need for parentheses:

Lwt_main.run begin
  let%lwt () = Lwt_unix.sleep 1. in
  let%lwt () = Lwt_io.printl "One second passed" in
  let%lwt () = Lwt_unix.sleep 1. in
  Lwt_io.printl "One more second passed"
end

This is much more readable than the expanded form:

Lwt_main.run begin
 Lwt.bind Lwt_unix.sleep 1. (fun () ->
   Lwt.bind (Lwt_io.printl "One second passed") (fun () ->
     Lwt.bind Lwt_unix.sleep 1. (fun () ->
       Lwt_io.printl "One more second passed")))
end

A similar inline expander is available for exception catching:

try%lwt
  f x
with
  | Failure msg ->
      prerr_endline msg;
      return ()

This expands to the following form and conveniently adds the generic exception case at the end automatically:

catch (fun () -> f x)
  (function
    | Failure msg ->
        prerr_endline msg;
        return ()
    | exn ->
        Lwt.fail exn)

Another useful PPX rewriter is ppx_inline_test, which allows for a very compact test syntax:

let is_even n = n % 2 = 0

let%test _ = is_even 2
let%test _ = not (is_even 1)

The _ here is a placeholder for the test's name and used to define anonymous tests.

This PPX also makes good use of attributes, which can, for example, be used to exclude certain tests when compiling OCaml to JS via js_of_ocaml:

let%test "native only" [@tags "no-js"] ...

Derivers

Extensions that add new nodes based on attributes instead of replacing existing AST nodes are often referred to as "derivers." These are particularly useful for repetitive or error-prone tasks or things that can be automated, like generating a different representation of existing data.

ppx_deriving

The ppx_deriving library is one of the best-known PPX in the OCaml ecosystem. It offers type-based code generation, similar to deriving clauses in Haskell typeclasses. It offers several plugins for different use cases, some of which we are going to explore now.

The show plugin derives a function to inspect and pretty-print a value, providing extra insight into the structure of values compared to other pretty-printers:

type t = [ `A | `B of int ] [@@deriving show]

The [@@deriving show] attribute will generate two functions for us, pp and show:

val pp : Format.formatter -> [< `A | `B of i ] -> unit = <fun>
val show : [< `A | `B of i ] -> string = <fun>

Using the generated show function produces an informative string:

show (`B 1);;
- : string = "`B (1)"

Equally useful are the eq and ord plugins, which are often used together. The first one provides an equality function while the latter defines an ordering function like Pervasives.compare or Ruby's "spaceship operator" (<=>).

Reusing the type from the earlier example,

type t = [ `A | `B of int ] [@@deriving eq, ord]

the following two functions will be added:

val equal : [> `A | `B of int ] -> [> `A | `B of int ] -> bool = <fun>
val compare : [ `A | `B of int ] -> [ `A | `B of int ] -> int = <fun>

equal `A`A;;
- : bool = true
equal `A (`B 1);;
- : bool = false
compare `A`A;;
- : int = 0
compare (`B 1) (`B 2);;
- : int = -1

ppx_sexp_conv

sexplib is an OCaml library for parsing and pretty-printing Lisp-like s-expressions. The corresponding ppx_sexp_conv library offers a convenient way for converting OCaml code to/from such s-expressions, which can be useful for serialization or domain-specific languages like the one used in Dune's build files.

The following code will generate two functions, sexp_of_int_pair and int_pair_of_sexp, which can be used to convert such a pair to or from an s-expression:

type int_pair = (int * int) [@@deriving sexp]

Summary

OCaml's extension points offer a powerful and flexible way to add new functionality to the language. While most developers may never write their own PPX, they'll almost certainly interact with them through popular libraries like Lwt or ppx_deriving. However, if you ever find yourself in a situation where a language extension seems like the best solution, ppxlib is an excellent library that greatly aids with writing one.

Michael Kohl

author

Michael's love affair with Ruby started around 2003. He also enjoys writing and speaking about the language and co-organizes Bangkok.rb and RubyConf Thailand.

Julia Hanke

illustrator

Julia Hanke is an illustrator living in Warsaw, Poland. She worked in creative agencies, currently works as fulltime freelance Illustrator, mainly making Illustrations for animations and web design. Now she is shifting her focus on editorial and children's book Illustrations. You can follow her on instagram @julia_hanke.