yceffort
When working on JavaScript projects these days, you'll notice there are tons of dependencies in devDependencies. JavaScript transpiling, code minification, CSS pre-processors, eslint, prettier, and so on. While these features don't make it to production code, they handle important tasks during development. And all of these tools operate based on AST processing.
Table of Contents
- What is AST?
- The Process of Creating AST from Code
- Understanding AST Node Types
- Use Case 1: Transpiling (Babel)
- Use Case 2: Automated Code Refactoring (JSCodeShift)
- Use Case 3: Linting (ESLint)
- Use Case 4: Code Formatting (Prettier)
- Use Case 5: Code Visualization
- Summary
What is AST?
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.
Simply put, it's the transformation of code strings into tree-structured data that computers can understand. Each element in the code (variable declarations, function calls, operators, etc.) becomes a node in the tree. An example will make this clearer.
All examples can be verified at https://astexplorer.net/
function square(n) {
return n * n
}
When this code is converted to AST, the tree structure looks roughly like this:
Program
โโโ FunctionDeclaration (name: "square")
โโโ params
โ โโโ Identifier (name: "n")
โโโ body (BlockStatement)
โโโ ReturnStatement
โโโ BinaryExpression (operator: "*")
โโโ left: Identifier (name: "n")
โโโ right: Identifier (name: "n")
You can see that every element of the code maps 1:1 to tree nodes. function square(n) becomes a FunctionDeclaration node, and return n * n inside becomes a BinaryExpression under ReturnStatement.
The actual AST JSON generated by parsers is much more verbose because it includes metadata like location information (loc, range, start, end). Here's the core structure extracted:
{
"type": "Program",
"body": [
{
"type": "FunctionDeclaration",
"id": { "type": "Identifier", "name": "square" },
"params": [{ "type": "Identifier", "name": "n" }],
"body": {
"type": "BlockStatement",
"body": [
{
"type": "ReturnStatement",
"argument": {
"type": "BinaryExpression",
"operator": "*",
"left": { "type": "Identifier", "name": "n" },
"right": { "type": "Identifier", "name": "n" }
}
}
]
}
}
]
}
If you're curious about the complete AST JSON, paste the code above into AST Explorer to see it immediately.
The Process of Creating AST from Code
But how does such a tree get created from a code string? It goes through two main stages.
Stage 1: Lexical Analysis
The lexical analyzer (also called scanner or tokenizer) breaks down the code string into token units. Tokens are the smallest meaningful units.
function square(n) {
return n * n
}
Tokenizing the above code produces this result:
[
{ type: 'keyword', value: 'function' },
{ type: 'identifier', value: 'square' },
{ type: 'punctuator', value: '(' },
{ type: 'identifier', value: 'n' },
{ type: 'punctuator', value: ')' },
{ type: 'punctuator', value: '{' },
{ type: 'keyword', value: 'return' },
{ type: 'identifier', value: 'n' },
{ type: 'punctuator', value: '*' },
{ type: 'identifier', value: 'n' },
{ type: 'punctuator', value: '}' },
]
The lexical analyzer reads the code character by character, distinguishing keywords like function, identifiers like square, and punctuation like (. Whitespace and line breaks are removed during this process.
Let's look at how a very simple tokenizer actually works. Below is a mini tokenizer that only handles numbers and arithmetic operations:
function tokenize(code) {
const tokens = []
let i = 0
while (i < code.length) {
const char = code[i]
// Skip whitespace
if (/\s/.test(char)) {
i++
continue
}
// Numbers: treat consecutive digits as one token
if (/[0-9]/.test(char)) {
let value = ''
while (i < code.length && /[0-9]/.test(code[i])) {
value += code[i++]
}
tokens.push({ type: 'number', value })
continue
}
// Operators
if ('+-*/'.includes(char)) {
tokens.push({ type: 'operator', value: char })
i++
continue
}
throw new Error(`Unknown character: ${char}`)
}
return tokens
}
tokenize('12 + 3 * 45')
// [
// { type: 'number', value: '12' },
// { type: 'operator', value: '+' },
// { type: 'number', value: '3' },
// { type: 'operator', value: '*' },
// { type: 'number', value: '45' },
// ]
The core principle is simple. Look at the current character and determine "is this the start of a number, an operator, or whitespace?" then classify it into the appropriate token. Real JavaScript parser tokenizers have to handle much more complex cases like strings ('...', "..."), regular expressions (/.../), template literals (`...`), but the basic principle is the same.
Stage 2: Syntax Analysis
The syntax analyzer (parser) takes the token list from above and assembles it into a tree structure according to the language's grammar rules. It applies grammar rules like "function keyword followed by identifier, with parameters in parentheses..." to create relationships between tokens as a tree. If code that doesn't match the grammar comes in, SyntaxError is thrown here. And the result of this process is the Abstract Syntax Tree.
One of the most interesting aspects of what parsers do is handling operator precedence. Consider 1 + 2 * 3. Simply reading left to right would give (1 + 2) * 3 = 9, but the mathematically correct result is 1 + (2 * 3) = 7. The parser represents this precedence through tree structure.
// AST for 1 + 2 * 3
// Multiplication is at a deeper position, so it's calculated first
BinaryExpression (+)
โโโ left: NumericLiteral (1)
โโโ right: BinaryExpression (*)
โโโ left: NumericLiteral (2)
โโโ right: NumericLiteral (3)
The * is positioned deeper (further down) in the tree than +. When evaluating the tree from bottom to top, 2 * 3 is calculated first, then 1 is added to that result. The operation order is naturally encoded in the tree structure even without explicitly writing parentheses.
The reason it's called "Abstract" is that syntactic decorations like parentheses and semicolons are implicitly contained in the tree structure itself, so they're not represented as separate nodes. In the above example, even if you write (2 * 3) with parentheses, the AST structure remains the same. The meaning of the parentheses (precedence) is already reflected in the tree structure.
Note: Trees that include all syntactic elements including parentheses and semicolons are called CST (Concrete Syntax Tree). Tools like prettier that need to preserve the original code format as much as possible sometimes use representations closer to CST.
JavaScript Parsers
Several parsers exist in the JavaScript ecosystem. Most follow the ESTree AST spec, so basic node structures are compatible even across different parsers.
| Parser | Language | Features |
|---|---|---|
| acorn | JS | Lightweight and fast. Default parser for webpack, eslint |
| @babel/parser | JS | Supports JSX, TypeScript, up to Stage 0 proposals. Provides ESTree compatibility mode |
| typescript | TS | Built-in parser for TypeScript compiler. Uses its own AST format (not ESTree) |
| SWC | Rust | Written in Rust. Dozens of times faster than Babel |
| oxc | Rust | Written in Rust. Parser from project aiming to replace ESLint |
Regardless of which parser you use, the basic "code โ tokens โ AST" pipeline structure is the same. However, they differ in supported grammar scope, performance, error recovery capabilities, etc.
Learn More
- If you want to learn about compilers, I recommend checking out The-super-tiny-compiler. It implements the simplest compiler example written in JavaScript.
- AST Explorer - Paste code to see AST immediately. You can also choose from multiple parsers.
- @babel/parser formerly babylon
Understanding AST Node Types
To work with AST, you need to understand the main node types. Based on the ESTree spec, JavaScript AST nodes are broadly divided into three categories.
Statement vs Expression
This distinction is most important.
- Statement: Performs actions. Doesn't produce values.
if,for,return,variable declarations, etc. - Expression: Produces values.
1 + 2,foo(),a ? b : c, etc.
// Statement: doesn't produce values (can't assign to variables)
if (true) { }
for (let i = 0; i < 10; i++) { }
// Expression: produces values (can assign to variables)
const x = 1 + 2
const y = condition ? 'a' : 'b'
const z = foo()
This distinction is important because it determines "what node types to look for" when traversing AST. For example, if you want to find all function calls, target CallExpression; if you want to find variable declarations, target VariableDeclaration (Statement).
Major Node Types
Here are the commonly encountered node types organized with code examples:
// VariableDeclaration + VariableDeclarator
const x = 1
// { type: "VariableDeclaration", kind: "const",
// declarations: [{ type: "VariableDeclarator",
// id: Identifier("x"), init: NumericLiteral(1) }] }
// FunctionDeclaration
function foo(a, b) { return a + b }
// { type: "FunctionDeclaration", id: Identifier("foo"),
// params: [Identifier("a"), Identifier("b")],
// body: BlockStatement }
// ArrowFunctionExpression
const add = (a, b) => a + b
// { type: "ArrowFunctionExpression",
// params: [Identifier("a"), Identifier("b")],
// body: BinaryExpression("+") }
// CallExpression
console.log('hello')
// { type: "CallExpression",
// callee: MemberExpression(console, log),
// arguments: [StringLiteral("hello")] }
// MemberExpression
obj.prop
obj['prop']
// { type: "MemberExpression", object: Identifier("obj"),
// property: Identifier("prop"), computed: false | true }
// ConditionalExpression (ternary operator)
a ? b : c
// { type: "ConditionalExpression",
// test: Identifier("a"),
// consequent: Identifier("b"),
// alternate: Identifier("c") }
// IfStatement
if (condition) { doA() } else { doB() }
// { type: "IfStatement",
// test: Identifier("condition"),
// consequent: BlockStatement,
// alternate: BlockStatement }
Do you see the pattern? All nodes are distinguished by the type field, and each node type has predetermined properties. BinaryExpression has left, operator, right, and IfStatement has test, consequent, alternate. Understanding this structure makes working with AST-based tools much easier.
The complete ESTree spec can be found at estree/estree.
Use Case 1: Transpiling (Babel)
The most representative use case for AST is transpiling. https://babeljs.io/ Babel is a JavaScript compiler that works in three main stages:
- Parsing: Convert code to AST
- Transforming: Traverse AST and transform it to desired form
- Generation: Output transformed AST back to code string
Parse & Generate
The most basic form is parsing and then generating code again.
import * as parser from '@babel/parser'
import generate from '@babel/generator'
const code = `const welcome = 'hello world'`
// 1. Code โ AST
const ast = parser.parse(code)
// 2. AST โ Code
const output = generate(ast)
console.log(output.code) // const welcome = 'hello world'
Looking at this alone might make you think "so what?" The key is transforming the AST between steps 1 and 2.
Traverse & Transform
Babel's real power lies in using @babel/traverse to traverse the AST and modify nodes. Let's look at a simple example. Code that changes all const to let:
import * as parser from '@babel/parser'
import _traverse from '@babel/traverse'
import _generate from '@babel/generator'
const traverse = _traverse.default
const generate = _generate.default
const code = `
const a = 1
const b = 2
`
const ast = parser.parse(code)
// Traverse AST and transform const โ let
traverse(ast, {
VariableDeclaration(path) {
if (path.node.kind === 'const') {
path.node.kind = 'let'
}
},
})
const output = generate(ast)
console.log(output.code)
// let a = 1;
// let b = 2;
The keys of the object passed to traverse are exactly the AST node types. Every time a node of type VariableDeclaration is encountered, the callback is executed. This structure is called the visitor pattern, and almost all AST-based tools use this pattern.
Understanding the path Object
The path that the callback receives in the above example isn't just a simple node wrapper. It's an object that contains all position and relationship information within the AST tree.
traverse(ast, {
Identifier(path) {
path.node // The current AST node itself
path.parent // Parent node
path.parentPath // Parent's path object
path.scope // Current scope information
// Manipulation methods
path.replaceWith(newNode) // Replace current node with another node
path.remove() // Remove current node
path.insertBefore(newNode) // Insert new node before current node
path.insertAfter(newNode) // Insert new node after current node
// Search methods
path.findParent(p => p.isFunction()) // Find parent matching condition
path.getSibling(0) // Access sibling nodes
}
})
path.scope is also a powerful feature. You can track where variables are declared and where they're referenced.
traverse(ast, {
Identifier(path) {
const binding = path.scope.getBinding(path.node.name)
if (binding) {
console.log(binding.kind) // 'const', 'let', 'var', 'param', etc.
console.log(binding.referenced) // Whether it's referenced
console.log(binding.references) // Number of references
console.log(binding.referencePaths) // Reference locations
}
}
})
Because of these features, tasks like "finding unused variables" and "safely renaming variables" become possible.
Babel Plugins
Let's look at one more practical example. A Babel plugin that removes all console.log:
// babel-plugin-remove-console.js
export default function () {
return {
visitor: {
CallExpression(path) {
const { callee } = path.node
if (
callee.type === 'MemberExpression' &&
callee.object.name === 'console' &&
callee.property.name === 'log'
) {
path.remove()
}
},
},
}
}
Among CallExpression nodes, it finds console.log calls and removes them with path.remove(). Actual plugins that remove console logs from production builds work this way.
For more details about babel, you can study at https://github.com/jamiebuilds/babel-handbook.
Use Case 2: Automated Code Refactoring (JSCodeShift)
The next use case we'll explore is JSCodeShift, which automatically refactors code. For example, let's say you want to perform this transformation:
// before
load().then(function (response) {
return response.data
})
// after
load().then((response) => response.data)
Since this isn't simple find-and-replace, it's impossible with regular text editors. jscodeshift makes this possible.
jscodeshift is a toolkit for running codemods. The actual AST-based transformation happens in codemods. The basic idea is similar to the relationship between babel and its plugins.
Writing a codemod that performs the above transformation looks like this:
// function-to-arrow.js
export default function transformer(file, api) {
const j = api.jscodeshift
return j(file.source)
.find(j.FunctionExpression)
.replaceWith((path) => {
const { params, body } = path.node
// If body is just a return statement, make it a concise arrow function
if (
body.body.length === 1 &&
body.body[0].type === 'ReturnStatement'
) {
return j.arrowFunctionExpression(params, body.body[0].argument)
}
return j.arrowFunctionExpression(params, body)
})
.toSource()
}
npx jscodeshift -t function-to-arrow.js src/
Running this will transform all function expressions to arrow functions in all files under src/. Whether there are hundreds or thousands of files doesn't matter. This is where AST-based transformation shines in large-scale refactoring.
React also provides codemods for major version updates. react-codemod has codemods that automatically handle transformations like createClass โ ES6 class, PropTypes separation, etc.
Use Case 3: Linting (ESLint)
ESLint also operates based on AST. It parses code into AST, then each rule uses the visitor pattern to traverse specific node types and find problems. The structure is almost identical to Babel plugins.
Let's create a simple custom rule. A rule that prohibits var usage:
// no-var.js
module.exports = {
meta: {
type: 'suggestion',
fixable: 'code',
},
create(context) {
return {
VariableDeclaration(node) {
if (node.kind === 'var') {
context.report({
node,
message: 'Use let or const instead of var.',
fix(fixer) {
return fixer.replaceTextRange(
[node.range[0], node.range[0] + 3],
'let',
)
},
})
}
},
}
},
}
Comparing with Babel plugin visitor structure, they're remarkably similar. The keys of the object returned by the create function are AST node types, and callbacks are executed every time a node of that type is encountered. The difference is that while Babel directly modifies the AST, ESLint reports problems with context.report(), and if auto-fixing is needed, it fixes at the text level through the fix function.
If you want to create custom ESLint rules yourself, also check out Creating My Own ESLint Rules.
Use Case 4: Code Formatting (Prettier)
Prettier also utilizes AST. It takes code, creates an AST, and outputs it again in a consistent style based on the AST. However, prettier has one more stage:
- Code โ AST
- AST โ IR (Intermediate Representation, called
Doc) - IR โ Formatted code
Stage 2 is the key. While converting AST nodes to an intermediate representation called Doc, it includes formatting hints like "if this part fits on one line, keep it on one line; if not, split into multiple lines." Then an algorithm called printer traverses the Doc and determines optimal formatting considering overall line length.
Understanding is faster when you see what Doc actually looks like. The Doc for code like foo(arg1, arg2, arg3) conceptually has this structure:
group([
"foo(",
indent([
softline,
"arg1,",
line,
"arg2,",
line,
"arg3",
]),
softline,
")"
])
Here group means "put on one line if possible, but split into multiple lines if not." line becomes a space in single-line mode and a line break in multi-line mode. softline inserts nothing in single-line mode and only line breaks in multi-line mode.
Thanks to this structure, prettier can format the same code differently depending on the situation to fit printWidth.
// When it fits within printWidth โ one line
foo(arg1, arg2, arg3)
// When it exceeds printWidth โ multiple lines
foo(
arg1,
arg2,
arg3,
)
This decision isn't made by simply looking at string length, but by understanding the AST structure, which produces consistent results even in nested structures.
If you want to learn more about prettier's algorithm, refer to Philip Wadler's paper A prettier printer.
Use Case 5: Code Visualization
AST also enables visual representation of code. js2flowchart is a library that converts JavaScript code to flowchart SVG.
The operating principle follows the same context we've examined so far:
- Code โ AST
- AST โ FlowTree (simplified tree with unnecessary nodes omitted)
- FlowTree โ ShapesTree (visual type, position, relationship information for each node)
- ShapesTree โ SVG
Ultimately, the pattern of using AST as an intermediate representation to transform code into other forms is consistent.
Summary
The common pattern of the tools we've examined so far can be summarized like this:
Code (string)
โ Lexical Analysis (tokenization)
Token list
โ Syntax Analysis (parsing)
AST
โ Transform/analyze/output
Result (new code, error report, SVG, etc.)
And tools that handle AST almost without exception use the visitor pattern. Babel, ESLint, and jscodeshift all have the same structure of passing an object with "interested node types as keys, callback functions as values."
// Babel plugin
{ visitor: { CallExpression(path) { ... } } }
// ESLint rule
{ create() { return { CallExpression(node) { ... } } } }
// jscodeshift
j(source).find(j.CallExpression).forEach(path => { ... })
Ultimately, there's one core principle: When you treat code as structured data rather than strings, precise operations that are impossible with text replacement become possible. AST is the most universal method for creating that structured data.