Tree sitter
What's tree-sitter?
Generally speaking, tree-sitter is a parser generator tool, which creates parsers for programming languages and parses source code to parse tree. Tree-sitter also provides incrementally parsing feature, making updating of parse tree efficient.
Tree-sitter aims to be
- General
- Fast
- Robust
- Dependency-free
At the moment(2022/01/23), tree-sitter has been used in many famous projects, such as Neovim, GitHub Code Search, etc.
Language bindings
Tree-sitter runtime is written in pure C, so it's easy to embed it to other languages. Tree-sitter provides many language bindings, you can find a list here: https://tree-sitter.github.io/tree-sitter/#language-bindings.
In this article, We'll use Rust bindings of tree-sitter. If you don't use Rust, don't worry! Other language bindings of tree-sitter should be similar to use.
Using Tree-Sitter Parsers in Rust
The Rust bindings of tree-sitter can be found here. You can see all API docs at https://docs.rs/tree-sitter/latest/tree_sitter/.
The official document also has introductiosn of tree-sitter's raw C API. If you want to use tree-sitter in other languages, C API is a good start to learn.
Getting Started
To use tree-sitter's Rust API, you should add tree-sitter dependency to your Cargo.toml
first:
|
|
You can also build tree-sitter from source, see: https://tree-sitter.github.io/tree-sitter/using-parsers#building-the-library
Basic Conceptions
There are four basic conceptions in tree-sitter:
- language: defines how to parse a particular language
- parser: generates syntax tree from source code
- syntax tree: is a tree representation of an entire source code file
- syntax node: is a single node in syntax tree
In Rust's API, they are called:
An Example
Here is an example in Rust that uses tree-sitter JSON parser. Because tree-sitter maintains parsers of most popular programming languages, we don't have to generate parsers for those languages again. Here is a list of available parsers for now.
To use tree-sitter's official JSON parser, just add tree-sitter-json
crate to your Cargo.toml
:
|
|
The example code of parsing JSON:
|
|
Execute cargo run
, you'll get the following output:
Syntax tree: (document (array (number) (null)))
The output is the S-expression of the JSON string input.
This program uses tree-sitter's Rust API and tree-sitter-json
crate. If you don't want to add tree-sitter-json
to your dependency list, there is a second approach: add cc
to your `[build-dependency] and embed the parser at build stage by add the following code to your build script:
|
|
|
|
Then, you can declare a language and set the language to your parser in your source code:
|
|
The second approach to use tree-sitter is what's in tree-sitter's official document, but anyway, I prefer the first approach.
Basic Parsing
In the section above, we have an example to parse JSON string. In this section, I'll introduce more about basic parsing using tree-sitter.
Providing the code
In the example above, we simply provide source code to the parser. However, in many modern editors and IDEs, the code is stored as other data structures such as rope and piece table(VSCode uses it as TextBuffer!).
To support those custom data structures, tree-sitter also provides parse_with()
method. With it, you can write your own function to read a chunk of text at a given byte_offset
and position
:
|
|
Syntax node
Each node in the parse tree has a type
, which is a string that defined in grammar:
|
|
You can also get all type definitions by
|
|
You can get the position of a syntax node in the form of range in bytes or row/column pair:
|
|
Retrieving nodes
Every parse tree has a root node:
|
|
You can also access children, siblings and parent of a Node
:
|
|
All those functions returns Option<Node>
. The returned value id None
means the required node is not exist. For example, the parent of a root_node is always None
:
|
|
Named nodes
The parse tree generated by tree-sitter is actually concrete syntax tree(CST), which contains all tokens in the source text. But in many cases, we don't need so much information. It's easier to do code analyze on an abstract syntax tree. An abstract syntax tree(AST) is also a parse tree, but has less information compared with a CST. Tree-sitter supports both use cases by using named nodes and anonymous nodes.
Named and anonymous nodes are defined in grammar. Here is an example:
if_statement: ($) => seq("if", "(", $._expression, ")", $._statement);
It's the definition of if_statement
syntax node. This node has 5 children: the condition expression($._expression
), the body expression($._statement
), "if", "(" and ")". In this case, the condition expression and the body expression are named nodes, because they have explicit names defined in the grammar. And "if", "(" and ")" are anonymous nodes, because they are represented as raw string in the grammar.
In the tree-sitter's parse tree, you can check whether a node is a named node using node.is_named()
.
When you traverse the tree, you can also use named variants of the describled methods above to ignore all anonymous nodes:
|
|
Field
To make the parse tree easier to analyze, tree-sitter supports field names of a syntax node in grammar. For example, you can use the field name to access children of a syntax node:
|
|
Field also has an unique id which you can use to access the child:
|
|
You can convert between the string field name and the field id using Language
:
|
|
Similarly, node's type has an id as well and you can use Language
to convert between them:
|
|
That's all about tree-sitter basics
In this article, we've covered most basic tree-sitter APIs. In the next articles, I'll introduce some advanced topic of tree-sitter, like incrementally parsing, pattern matching and grammar writting.