Introducing Rust Sitter

Write fast and robust parsers with ease

Shadaj Laddad · August 2022

tl;dr: Rust Sitter is a new approach to writing parsers in Rust. By combining macros with an external compile-time tool, it can automatically generate efficient parsers with Tree Sitter and bindings that extract results into high-level user structures.

At Berkeley, I'm a PhD student on the Hydro project, which is focused on designing new languages, compilers, and runtime systems that make it easy for anyone to write efficient distributed systems. Recently, we've started to build out compilers at lower levels of the Hydro stack, allowing Datalog-like programs that describe a computation graph to be compiled to Hydroflow, our low-level dataflow runtime. There are a lot of challenging type checking and optimization problems to solve in this project, yet one of the most painful parts ends up being last thing on anyone's mind: the parser.

Parsers have a variety of constraints corresponding to their wide range of uses:

  • they must be fast, since developers need quick iteration cycles to be productive
  • they must have strong support for error handling and recovery, so that the compiler can guide new users with helpful error messages
  • they should output parsing results in a semantically meaningful way, so that downstream logic such as typechecking can operate on the results
  • ideally, they should be incremental, so that they can be used in IDE tooling

Meeting these constraints typically involves creative use of parser generators like Bison or requires the developer to create a parser from scratch. And even after managing to cobble together a parser, it will often still lack features such as incremental updates. Sufficiently frustrated by this process when working on Hydro, I set out to search for a better approach. In this post, I introduce Rust Sitter: a new tool for writing parsers which makes it easy to leverage the power of modern parsing algorithms without leaving the comfort of your existing Rust codebase.

Tree Sitter has recently emerged as a parser generator that is capable of satisfying most the constraints we have laid out. By combining advanced parsing algorithms with an intuitive API for defining grammars and a laser-focus on incremental parsing and embeddability, Tree Sitter is excellent for a wide range of use cases. It's already in production use as a part of GitHub's semantic analyses and Atom's syntax highlighting. Plus, the community around Tree Sitter has proven its flexibility by implementing a variety of grammars for popular languages.

There's one part of using Tree Sitter that's still a bit tricky: writing the bindings that extract the parsing results into semantically meaningful types in your favorite language (in our case, Rust). In other parsing tools for Rust, like nom, this is handled by combining low-level primitives into parsing rules for high-level structures through regular Rust function calls. This means that the parsing results are immediately available in the types you want, but because the grammar is scattered across several functions it misses out on end-to-end analyses such as detecting conflicts and ambiguities at compile time.

Rust Sitter aims to solve this by making it possible to define Tree Sitter grammars with annotations on Rust types. At compile time, Rust Sitter automatically compiles an efficient parser implementation with Tree Sitter and generates the bindings to it. With Rust Sitter, developers can utilize the power of Tree Sitter without having to leave the Rust codebase they are already working in or spending hours writing boilerplate.

Tree Sitter Today

Before we dive into Rust Sitter, let's see what it's like to use Tree Sitter by itself.

The first step to using Tree Sitter is defining the grammar, which formally defines the language we are trying to parse. Tree Sitter grammars are defined in the popular context-free style, where the language is defined as a set of rules that define how a non-terminal (corresponding to a node in the AST) can be constructed from a concatenation of other non-terminals and terminals (typically defined as regular expressions).

Tree Sitter grammars can be defined in a JavaScript file with a simple API. To resolve grammar ambiguities, such as associativity of binary operators, Tree Sitter provides a variety of built-in annotations for defining precedence levels and left/right associativity. For example, a simple grammar for arithmetic expressions looks like this:

{
    rules: {
        // defines the root node of the language
        source_file: $ => $.expression,

        // a non-terminal
        expression: $ => choice(
            $.number,
            $.expression_plus,
            $.expression_times
        ),

        expression_plus: $ => prec.left(1,
            seq($.expression, '+', $.expression)
        ),
        expression_times: $ => prec.left(2,
            seq($.expression, '*', $.expression)
        ),

        // a terminal
        number: $ => /\d+/
    }
}

Once you have a grammar written, you can compile it to a dependency-free C parser using the Tree Sitter CLI. The C APIs can then be imported to Rust as an extern using the tree-sitter crate, which defines the native types used by the underlying parser.

extern "C" {
    fn tree_sitter_grammar() -> Language;
}

pub fn language() -> Language {
    unsafe { tree_sitter_grammar() }
}

But we are not quite done yet. We need to define the semantically-meaningful AST that we want to extract from the parsing results. Tree Sitter itself just provides us an untyped tree (effectively nested maps) with strings for leaves, which is difficult to use in downstream logic. For the above grammar, an ideal AST would look like this:

#[derive(Debug)]
pub enum Expression {
    Number(i32),
    Add(Box<Expression>, Box<Expression>),
    Multiply(Box<Expression>, Box<Expression>),
}

We need to write a fair amount of logic to convert the nodes returned by Tree Sitter into this Rust type. Most of this involves navigating the tree and extracting the data we want. For example, at when extracting an Expression, we have to use Tree Sitter APIs to identify which choice was taken. We also have to write further logic to either recursively parse more expressions or to extract the number as an i32 from a leaf.

fn extract_expression(node: Node, source: &[u8]) -> Expression {
    let first_child = node.child(0).unwrap();
    match first_child.kind() {
        "number" => Expression::Number(first_child.utf8_text(source).unwrap().parse().unwrap()),
        "expression_plus" => {
            let left = extract_expression(first_child.child(0).unwrap(), source);
            let right = extract_expression(first_child.child(2).unwrap(), source);
            Expression::Add(Box::new(left), Box::new(right))
        },
        "expression_times" => {
            let left = extract_expression(first_child.child(0).unwrap(), source);
            let right = extract_expression(first_child.child(2).unwrap(), source);
            Expression::Multiply(Box::new(left), Box::new(right))
        },
    }
}

This type of logic has to be written for every node in the AST, and quickly adds up to a lot of boilerplate. It's also easy to make mistakes when writing this extraction logic, especially if the grammar is changed but the extraction logic is not kept up to date.

Using Rust Sitter

This is where Rust Sitter comes in. With Rust Sitter, we can skip defining the grammar in a separate file and instead simply annotate the existing AST types we have created to map them to the language we want to parse.

Let's start with our existing expression example. A Rust Sitter grammar is defined as a module annotated with the #[rust_sitter::grammar(...)] proc macro. Exactly one type in this module should be annotated as #[rust_sitter::language] to mark it as the entrypoint for parsing:

#[rust_sitter::grammar("arithmetic")]
mod grammar {
    #[rust_sitter::language]
    pub enum Expression {
        Number(i32),
        Add(Box<Expression>, Box<Expression>),
        Multiply(Box<Expression>, Box<Expression>),
    }
}

Next, we need to define how the types in our AST correspond to the strings we want to parse. Let's start with the Number node, which can be described using a regular expression:

Number(
    #[rust_sitter::leaf(
        pattern = r"\d+",
        transform = |v| v.parse().unwrap()
    )]
    i32
)

In a single annotation, we specify that:

  1. The Number node is a leaf node, which means that it directly corresponds to a token (consecutive sequence of characters) in the source.
  2. A Number token is any sequence of characters consisting of one or more digits (specified by the pattern regular expression).
  3. Once we have extracted the token corresponding to the number (v: &str), we can parse it into an i32 using the specified transform.

Next, let's annotate the Add and Multiply nodes. This is a bit more complicated, since in our existing types there is no field in the Add or Multiply variants that correspond to the "+" or "*" characters. So we can augment these variants with a new field, and annotate them to match the "+" and "*" strings, respectively. Rust Sitter automatically identifies the references to Expression and wires up the recursive grammar appropriately.

pub enum Expression {
    ...,
    Add(
        Box<Expression>,
        #[rust_sitter::leaf(text = "+")]
        (),
        Box<Expression>
    ),
    Multiply(
        Box<Expression>,
        #[rust_sitter::leaf(text = "*")]
        (),
        Box<Expression>
    ),
}

We are almost done! We can define a build.rs that will automatically generate a Tree Sitter grammar based on our Rust Sitter definitions and compile it to native code.

use std::path::PathBuf;

fn main() {
    rust_sitter_tool::build_parsers(&PathBuf::from("src/main.rs"));
}

But if we compile, we'll get an error:

Unresolved conflict for symbol sequence:

    Expression  Expression_Add_1  Expression  •  Expression_Add_1  …

  Possible interpretations:

    1:  (Expression_Add  Expression  Expression_Add_1  Expression)  •  Expression_Add_1  …
    2:  Expression  Expression_Add_1  (Expression_Add  Expression  •  Expression_Add_1  Expression)

  Possible resolutions:

    1:  Specify a left or right associativity in `Expression_Add`
    2:  Add a conflict for these rules: `Expression_Add`

This is an error generated by Tree Sitter, and it tells us that there is an ambiguity in the grammar. In this case, Tree Sitter does not know if an expression like 123 + 456 + ... should be parsed as (123 + 456) + ... or 123 + (456 + ...). We can resolve this by specifying an associativity for Add (and Multiply, because it has the same issue) to group the terms on the left together first (corresponding to a left-to-right evaluation order):

#[rust_sitter::prec_left(1)]
Add(...),
#[rust_sitter::prec_left(2)]
Multiply(...),

You might notice that we pass a number as a parameter to the prec_left annotation; this specifies the precedence level in the grammar. If we have an expression like 123 + 456 * 789 without the precedence levels, there is another ambiguity: should it be parsed as (123 + 456) * 789 or 123 + (456 * 789)? By specifying a higher precedence level for the Multiply rule, we tell Tree Sitter that we want that to bind more tightly than the Add rule.

Now that we have our grammar defined, we can parse a string!

dbg!(grammar::parse("1 + 2 * 3 + 4"));
/*
Ok(Expression::Add(
    Expression::Add(
        Expression::Number(1),
        Expression::Multiply(
            Expression::Number(2),
            Expression::Number(3)
        )
    ),
    Expression::Number(4)
))
*/

Because we use Tree Sitter under the hood, we also automatically get high-quality error messages when we parse invalid strings. With a little big of logic to format errors with the codemap-diagnostic crate, we can get easy to understand messages. Note that Tree Sitter continues to attempt to parse the string even after encountering the first error!

> 1 + 2 - 3 / 4

error[S000]: Unexpected token: "-"
 --> <input>:1:7
  |
1 | 1 + 2 - 3 / 4
  |       ^ unexpected "-"

error[S000]: Unexpected token: "/"
 --> <input>:1:11
  |
1 | 1 + 2 - 3 / 4
  |           ^ unexpected "/"

And that's it for our quick tour of Rust Sitter! There are a lot more features that Rust Sitter supports (see the docs!):

  • optional / repeated elements with Option and Vec (including annotations for defining delimiters)
  • defining "extra" rules that can be used to match strings anywhere, useful for whitespace and comments
  • additional annotations for specifying more complex grammars

But for now, we know enough to dive into the internals of Rust Sitter.

How it Works

Rust Sitter has two key roles:

  1. extract a Tree Sitter grammar definition
  2. generate Rust bindings that extract parsing results to the user-defined types

User Sources
Rust Compiler
Rust Sitter Tool
Rust Sitter Macro
Tree Sitter JSON
Rust Bindings
C Parser

the two-phase architecture of Rust Sitter

Fundamentally, both of these operate in a similar way: parse the Rust code within an annotated module and generate either Rust or JSON. When we generate Rust code, we must use a macro since we want the emitted logic to be immediately compiled. On the other hand, the JSON is used to generate the C parser implementation, which must then be compiled and linked into the final binary by Cargo, so it must exist as a separate tool that can be invoked in a build.rs.

This split mirrors Diplomat, a project I started last summer at Google with ManishEarth. Diplomat is focused on making it easier to call Rust APIs from other languages, like C++ and JavaScript. There, the macro generates the Rust layer that is exported over FFI, and the tool generates the binding logic for every target language. In Diplomat, common logic analyzes the entire Rust code and generates an intermediate representation that is used to drive both codegen backends. In Rust Sitter, which requires simpler analysis, we take the lightweight approach of sharing some common utilities but not creating an entire IR.

The Tool: Tree Sitter JSON

The first step in the Rust Sitter pipeline is to generate Tree Sitter grammar definitions based on the annotated Rust modules. Since the tool component exists outside of the Rust compilation process (not a macro), we need to parse the source code ourselves to identify grammar definitions.

The core analysis in both the tool and macro uses the syn crate, which provides a Rust AST definition that we can use to navigate the program. In the macro, we can directly parse an AST from the tokens provided by the Rust compiler, but for the external tool we use the syn-inline-mod crate to parse the root file (main.rs or lib.rs) of the project and recursively extract all defined modules. Then, we can recursively explore the AST to find all Rust Sitter modules.

pub fn generate_grammars(root_file: &Path) -> Vec<String> {
    let root_file_items = syn_inline_mod::parse_and_inline_modules(root_file).items;
    let mut out = vec![];
    root_file_items
        .iter()
        .for_each(|i| generate_all_grammars(i, &mut out));
    out
}

Once we have found a grammar definition, the fun begins. As we discussed before, Tree Sitter typically takes in a JavaScript grammar definition, but this adds an additional compile-time dependency on Node. Instead, we can directly generate the JSON that the JavaScript emits and pass it to Tree Sitter to generate the C parser.

We need to iterate through all types defined in the module and create corresponding non-terminals in the JSON grammar definition. For each element, we invoke different logic based on whether it is a enum or struct definition, since those have different corresponding rules.

For enums, we generate a "CHOICE" node that recursively references non-terminals created for each of the variants.

match item {
    Item::Enum(e) => {
        e.variants.iter().for_each(|v| {
            gen_struct_or_variant(
                // like "Expression_Number"
                format!("{}_{}", e.ident, v.ident),
                v.attrs.clone(),
                v.fields.clone(),
                &mut rules_map,
            )
        });

        let rule = json!({
            "type": "CHOICE",
            "members": e.variants.iter().map(|v| {
                json!({
                    // a "SYMBOL" references another non-terminal
                    "type": "SYMBOL",
                    "name": format!(
                        "{}_{}",
                        e.ident.clone(), v.ident
                    )
                })
            }).collect::<Vec<_>>()
        });
        ...
    }
    ...
}

For structs, we use similar logic to generate a "SEQ" node that contains "FIELD" nodes referencing a non-terminal for each field. When generating the grammar definitions for each field, we need additional logic to handle leaf annotations, which take a variety of named parameters such as pattern and transform. Luckily, syn gives us a flexible approach to treat the parameters to the attribute as just a series of tokens that we can choose to parse as key-value pairs.

By defining a custom struct (NameValueExpr) that implements the Parse trait, we can parse named parameters by scanning for an identifier, then the = character, and then using syn's built-in Expr type to parse the entire expression. By leveraging existing AST definitions in syn, we can use this to parse even complex parameters such as transform, which takes in a closure expression.

pub struct NameValueExpr {
    pub path: Ident,
    pub eq_token: Token![=],
    pub expr: Expr,
}

impl Parse for NameValueExpr {
    fn parse(input: ParseStream) -> syn::Result<Self> {
        Ok(NameValueExpr {
            path: input.parse()?,
            eq_token: input.parse()?,
            expr: input.parse()?,
        })
    }
}

// ...

let leaf_attr = field
    .attrs
    .iter()
    .find(|attr| attr.path == syn::parse_quote!(rust_sitter::leaf));

let leaf_params = leaf_attr.and_then(|a| {
    a.parse_args_with(
        Punctuated::<NameValueExpr, Token![,]>::parse_terminated
    ).ok()
});

With similar logic for the other annotations, such as prec_left, and additional rules for generating "REPEAT" nodes when the Rust types involve Vecs, we can now generate the entire grammar definition. With this JSON, we then turn to Tree Sitter to generate an efficient C implementation of the parser. Because the Tree Sitter CLI is written in Rust and its core APIs are publicly exported in the tree-sitter-cli crate, we can directly invoke this logic from Rust without having to shell out!

use tree_sitter_cli::generate::generate_parser_for_grammar;
let (name, c_source) = generate_parser_for_grammar(grammar).unwrap();

With the C sources in hand, we can finally compile the parser into a static library that is linked into the final binary.

let dir = TempDir::new("grammar").unwrap();
let grammar_file = dir.path().join("parser.c");
let mut f = std::fs::File::create(grammar_file).unwrap();

f.write_all(grammar_c.as_bytes()).unwrap();
drop(f);

cc::Build::new()
    .include(&dir)
    .file(dir.path().join("parser.c"))
    .compile(&name);

The exported symbols from this parser will then need to be consumed by the Rust binding logic, which we discuss next!

The Macro: Rust Bindings

Once we have our C parser instantiated, we need to generate the runtime logic that navigates the Tree Sitter results and converts each parsed node into the corresponding Rust struct / enum. Similar to before, this involves an AST analysis, but instead of emitting JSON we generate new Rust code.

Rust Sitter comes with an Extract trait which captures the interface by which each node in the AST is converted into a Rust struct/enum. By having a common interface, we can easily perform recursive conversions by jumping between the impls for each type.

pub trait Extract {
    fn extract(node: tree_sitter::Node, source: &[u8]) -> Self;
}

The job of our proc macro is to generate Extract implementations for each type that corresponds to a non-terminal in the grammar. This mostly follows the same structure as generating the JSON for each type. For example, when generating the Extract impl for an enum, we start by generating functions (with the same naming scheme as the JSON) for each variant. Each of these functions will be emitted into the body of the extract function, since we do not need them to be publicly accessible and they will only be used by the remaining logic for the impl.

e.variants.iter().for_each(|v| {
    gen_struct_or_variant(
        // like "Expression_Number"
        format!("{}_{}", e.ident, v.ident),
        v.fields.clone(),
        Some(v.ident.clone()),
        e.ident.clone(),
        // a vector of functions that will be placed inside `extract`
        &mut impl_body,
    )
});

// results in

fn extract_Expression_Number(node: tree_sitter::Node, source: &[u8]) -> Expression {
    // ...
}

Then, we can generate a match expression that will call the appropriate function for each variant. The syn::parse_quote! macro is extremely helpful when writing macro expansion logic, as it allows us to generate Rust trees by writing regular Rust code with snippets mixed in using #.

let match_cases: Vec<Arm> = e.variants.iter().map(|v| {
    let variant_path = format!("{}_{}", e.ident, v.ident);
    let extract_ident =
        Ident::new(&format!("extract_{}", variant_path), Span::call_site());
    syn::parse_quote! {
        #variant_path => #extract_ident(node.child(0).unwrap(), source)
    }
}).collect();

Finally, we can emit the impl for the Extract trait:

syn::parse_quote! {
    impl rust_sitter::Extract for #enum_name {
        #[allow(non_snake_case)]
        fn extract(node: tree_sitter::Node, source: &[u8]) -> Self {
            #(#impl_body)*

            match node.child(0).unwrap().kind() {
                #(#match_cases),*,
                _ => panic!()
            }
        }
    }
}

An interesting trick appears when we handle the transform argument for leaf nodes, which allows users to define a closure that transforms the leaf string into a richer type. If we simply call the closure in IIFE form, we end up with type-inference issues that force the user to manually specify the input type. An expansion of the form:

syn::parse_quote! {
    (#closure)(#leaf_text_expr.unwrap())
}

leads to compile-time errors like this:

error[E0282]: type annotations needed
 --> example/src/arithmetic.rs:6:67
  |
6 |         Number(#[rust_sitter::leaf(pattern = r"\d+", transform = |v| v.parse().unwrap())] i32),
  |                                                                   ^ consider giving this closure parameter a type
  |
  = note: type must be known at this point

But we do know what the type of the input is (&str), so how can we make the user experience a bit nicer and handle this inference? A immediate idea is to store the closure into an explicitly typed variable, but Rust doesn't let us use impl Fn types in that position. What we can do instead is create a local function that has the impl Fn type and wraps around the user provided function (allowing type inference to go through), and then using that function to get the closure we call at runtime:

syn::parse_quote! {
    fn make_transform() -> impl Fn(&str) -> #inner_type {
        #closure
    }
}

// ...

syn::parse_quote! {
    make_transform()(#leaf_text_expr.unwrap())
}

Tricks like these make it much easier to interact with the macro, since users can now use the mental model of calling regular Rust functions when using annotations like rust_sitter::leaf. Furthermore, since we preserve the location information throughout the macro by using syn to handle all the transformations, users can get rich diagnostics when using tools like Rust Analzyer, which makes the experience much more pleasant.

The rest of the Rust Sitter macro follows the same patterns, just for other constructs like structs. With the tool configured in a build.rs and the macro automatically invoked by the Rust compiler, this two-step process enables fully automated use of Tree Sitter from Rust!

What's Next

We've started using Rust Sitter already to write parsers for Datalog and Dedalus in the Hydro project, but there are a lot more pieces of the puzzle that are still missing!

Error Recovery

Right now, Rust Sitter returns a top level Result that either contains a fully parsed AST or a list of errors reported by Tree Sitter. But one of the key features of Tree Sitter is that it can return a partial parse tree if it encounters an error. An exciting challenge is exploring how this can be brought to the Rust level. Perhaps Rust Sitter grammars can use a Result type for AST fields to mark where errors can be collected, similar to React's error boundaries?

Incremental Parsing

A key feature of Tree Sitter is its ability to perform incremental parsing when small edits are made to the input document. But Rust Sitter currently creates a brand new AST from scratch every time we call extract. While this is still quite fast, it's not ideal for large documents or in applications like IDEs where latency is a concern. Supporting incremental parsing will likely require expanding the Extract trait to allow users to keep around a struct that tracks the current parsing state.

Exporting Grammars

Rust Sitter generates grammars solely to be used by the corresponding Rust bindings. But the grammar definition generated under the hood can be useful for other purposes, such as performing syntax highlighting in an editor. In the future, we hope to add APIs to Rust Sitter that allow users to annotate the grammar with rich metadata such as color highlighting rules, which can then be included when exporting the grammar for external use.

Try it Out!

Rust Sitter is available on crates.io today! You can check out (and contribute to) the Rust Sitter repository on GitHub. Rust Sitter is still in the early stages of development, so please let us know if you run into roadblocks or have ideas for new features!

Thanks to Joe Hellerstein, Audrey Cheng, and Ramnivas Laddad for feedback on this post!

Footnotes

  1. Sufficiently difficult to be an entire project in the Berkeley compilers class. Students almost never manage to pass all the correctness tests.

  2. Which unfortunately many modern languages like Rust itself have had to end up doing.

  3. Specifically, Tree Sitter uses the GLR algorithm, which means that it uses similar high-level concepts as the LR(1) parsing algorithm while supporting a wider range of languages. This property makes it much easier to debug grammar conflicts, since they are presented in familiar forms such as shift/reduce.

  4. Which itself is written in Rust! This might be a useful property 👀

  5. This is precisely the difference between an Abstract Syntax Tree and a Concrete Syntax Tree! Rust Sitter grammars parse into something in between, since we support lossy extraction of fields into a () type which means that we can't necessarily convert nodes back into the original string.

  6. This actually ends up being a bit more complicated since Tree Sitter includes extra nodes (for whitespace) as siblings of the repeated elements (as it is emitting a CST). So we have to use "fields" to mark the children we want to extract.

  7. "Immediately Invoked Function Expression", a term from the JavaScript community where such patterns used to be quite common as a way to create new scopes.

  8. This actually will be possible in the near-ish future, as part of the existential type RFC. You can already use impl Trait for variable types with impl_trait_in_bindings in nightly toolchains.