(Paper Reading) ACT, Automatically Generating Compiler Backends from Tensor Accelerator ISA Descriptions

The Big Picture: What Problem is ACT Solving?
Formal Things
The Compilation Flow in Action
- Phase 1: Instruction Selection (Finding the “What”; Or solving α)
- Phase 2: Memory Allocation (Finding the “Where”; Or solving β)
An Intuitive Guide to the Proofs
- Proving Soundness (“Our output is always correct”)
- Proving Completeness (“If a solution exists, we will find it”)
Thinking

@misc{act-arxiv,
  title = {ACT: Automatically Generating Compiler Backends from Tensor Accelerator ISA Descriptions},
  author = {Jain, Devansh and Pardeshi, Akash and Frigo, Marco and Patel, Krut and Khulbe, Kaustubh and Arora, Jai and Mendis, Charith},
  year = {2025},
  eprint = {2510.09932},
  archiveprefix = {arXiv},
  primaryclass = {cs.PL},
  url = {https://arxiv.org/abs/2510.09932},
  doi = {10.48550/arXiv.2510.09932},
  month = oct
}

This paper introduces the “ACT Backend Generator: Automatically generates sound and complete compiler backends just from TAIDL specification” of the ACT (Accelerator Compiler Toolkit) Ecosystem, an ecosystem that automatically generates complete compiler backends and essential software tooling from high-level ISA specifications of tensor accelerators.

This work is very interesting. It is even just the work I want to do! Consider we have a set of RISC-V extension instructions described, the automatically generated compiler is the key to use them properly for final acceleration effects. We have done the APS (ICCAD ’25) work, and a powerful compiler is definitely a missing piece now.

The Big Picture: What Problem is ACT Solving?

Imagine you are a hardware architect and you’ve just designed a new, super-fast chip (a “tensor accelerator”) specifically for running AI models. For anyone to use your chip, they need a compiler that can translate standard AI programs (like those written in TensorFlow or PyTorch) into the specific machine code your chip understands.

The Problem: Writing this compiler backend is incredibly difficult, time-consuming, and needs to be redone for every new accelerator design. Most new accelerators either don’t get a proper compiler, forcing people to write code by hand, or they get a very limited one that only supports a few operations. This severely limits their adoption.

ACT’s Solution: ACT is a “compiler-compiler”. You don’t write the compiler backend yourself. Instead, you provide ACT with a high-level description of your new accelerator as ISA—what memory it has and what instructions it can execute. From this description alone, ACT automatically generates a complete, correct, and efficient compiler backend for you.

Formal Things

Let’s think about the formal things from the paper in an intuitive style.

1. How to Describe a Tensor Accelerator? (ISA Description)

ACT needs a “blueprint” of the accelerator. This is just the TAIDL at MICRO’25.

This blueprint has two parts:

Data Model (D^H): This describes the memory hierarchy. Think of it as listing the available storage. For the paper’s example accelerator, H_QKV, this is:
- d0: A big, main memory (Global Memory).
- d1: A small, fast scratchpad memory (Main Scratchpad).
- d2: Another even smaller, faster buffer (Intermediate Buffer).
Instruction Set (Θ^H): This is the list of commands the accelerator understands. The key insight of ACT is how it describes what these commands do. Instead of inventing a new language, it describes them using the same basic tensor operators (like reshape, dot, transpose) from XLA-HLO IR.
- Example: The load_rm instruction. This instruction moves data from main memory (d0) to the scratchpad (d1). The paper doesn’t just treat it as a black box. It formally describes its behavior as y = bitcvt(reshape(x1)). This means the instruction takes a 1D stream of bytes (x1), reshapes it into a 2D matrix, and bit-casts the data type.
- Parameterized Instructions: Notice that load_rm is parameterized. It can load a variable number of rows (n). This is a huge challenge that ACT is designed to handle. n is a “computational attribute” (α), while the memory addresses it reads from/writes to are “addressing attributes” (β).

Here, the interesting things are the “computational attribute” (α) and “addressing attributes” (β). They are also the “design space” of the tensor compilation problem consider in this paper.

But, are they truly general enough?

2. What Does It Mean for Compiled Code to Be “Correct”? (Semantic Equivalence, ≡R)

Although the paper treats the “correctness” as the most critical question, I take more interest in compiler rather than verification. So I won’t describe any details here but only summarize some key ideas.

The Intuitive Idea: Imagine you have the original program (a computation graph G) and the final assembly code (ASM). To check if they are equivalent, you perform the following thought experiment (visualized in Figure 5):
1. Take some input tensors (X).
2. LHS (Left-Hand Side): Run the assembly code. This involves loading the inputs into the accelerator’s main memory, executing the instruction stream, and then reading the final result back from memory.
3. RHS (Right-Hand Side): Run the original high-level graph G on the same inputs to get the “golden” output tensor.
4. Comparison: Flatten both the final output from the LHS and the golden output from the RHS into a single, long stream of bytes. If these two byte streams are absolutely identical for any possible input X, then the assembly code is semantically equivalent to the original graph.

Here, byte-flattening (bflat) looks kind of verbose…

3. What Makes a Good Compiler Backend Generator? (Soundness & Completeness)

Soundness: “If ACT generates a compiler backend, and that backend successfully compiles a program, the resulting assembly code is guaranteed to be correct.” It will never produce a wrong answer. This is the bare minimum for a compiler.
Completeness: “If it is possible to write correct assembly code for a given program on a given accelerator, the backend generated by ACT is guaranteed to find a solution.” It won’t fail and give up if a valid program exists. This is a very strong and desirable property.

This is what the paper wants to achieve.

The Compilation Flow in Action

Now, let’s walk through how ACT actually compiles a program, using the paper’s running example: compiling the G_QKV kernel (from Fig. 2) for the H_QKV accelerator (from Fig. 3). The process is shown in Figure 8.

The compilation is a two-phase process:

Phase 1: Instruction Selection. What is the sequence of accelerator instructions we should use?
Phase 2: Memory Allocation. Where in the scratchpads should we store the intermediate data?

Phase 1: Instruction Selection (Finding the “What”; Or solving α)

This phase’s goal is to convert the high-level tensor graph into a graph of accelerator instructions. It uses a powerful technique called Equality Saturation with an e-graph.

What is an E-Graph? An e-graph is a data structure that efficiently stores a massive collection of programs that are all semantically equivalent to each other.
Step 1: E-Graph Initializer (§5.1)
- Action: The initial G_QKV computation graph is loaded into the e-graph.
- Example: Figure 9(a) shows the initial e-graph. At this point, it just contains the original program’s operators (reshape, bitcvt, etc.).
Step 2: Rewrite Applier / E-Graph Explorer (§5.2)
- Action: This is the core of instruction selection. ACT automatically generates rewrite rules from the ISA description.
  - An “IR-to-ISA” rule is created for each instruction. For example: bitcvt(reshape(x)) can be rewritten as load_rm(x). Another rule might be dot(x1, x2) can be rewritten as gemm(x1, x2).
- ACT repeatedly applies these rules to the e-graph, discovering new, equivalent ways to compute the same results using the accelerator’s instructions.
- Example: In Figure 9(b), you can see the e-graph has been “explored”. New nodes for load_rm, transpose (which could map to load_cm), gemm, and mov have been added. The e-graph now contains both the original implementation and several possible implementations using the HQKV instruction set.
Step 3: Graph Extractor (§5.3)
- Action: After exploring the possibilities, ACT extracts one valid “plan” from the e-graph. This plan is a graph made up entirely of accelerator instructions. It’s called a pii graph (partially instantiated instruction graph) because we know what instructions to run (e.g., load_rm), but we don’t yet know the exact memory addresses (the ? in load_rm64,?).
- Example: The pii graph shown in Figure 10(a) is one such plan extracted from the explored e-graph of Figure 9(b). It’s a concrete sequence of operations for the HQKV accelerator.

Phase 2: Memory Allocation (Finding the “Where”; Or solving β)

Now that we have a plan (the pii graph), we need to figure out where to place all the intermediate tensors in the scratchpads (d1, d2).

Step 4: Topological Ordering Generator (§6.1)
- Action: The instructions in the pii graph must be executed in an order that respects data dependencies (you can’t use a result before it’s computed). There can be multiple valid orders. ACT chooses an order that tries to minimize the amount of memory needed at any given time, which makes the next step more likely to succeed.
- Example: The circled numbers (① through ⑨) in Figure 10(a) show the chosen execution order.
Step 5: Constraint Satisfaction Problem Generator & Solver (§6.2)
- Action: This is like solving a complex logic puzzle. ACT generates a set of rules (constraints) that a valid memory allocation must obey:
  1. Liveness: If two tensors need to exist in memory at the same time, they cannot be allocated to overlapping memory regions.
  2. ISA Constraints: Each instruction has rules about where its inputs can come from and where its output can go. For example, the gemm instruction might require its inputs to be in scratchpad d1 and its output to be in buffer d2.
  3. Dataflow: The output address of one instruction must match the input address for the next instruction that uses its data.
- ACT feeds all these rules to a CP-SAT solver, a generic tool for solving these kinds of puzzles. The solver’s job is to find values for all the unknown address attributes (the ?s).
- Example: The solver figures out the exact data slices for each operation, shown by the purple annotations in Figure 10(a). For instance, it determines the first load_rm should write to d1[0:64].
Step 6: Code Emitter (§6.3)
- Action: The solver returns a complete, valid solution. The final step is to simply print out the instructions with their now-known attributes into an assembly file.
- Example: The final, complete assembly code is shown in Figure 10(b). The plan from Phase 1 has been successfully turned into a concrete, executable program.

An Intuitive Guide to the Proofs

This section proves that the whole process is sound and complete.

Proving Soundness (“Our output is always correct”)

The proof relies on chaining together two key ideas (Lemma 1 and Lemma 3):

Phase 1 is Sound: The rewrite rules used in the e-graph only add truly equivalent computations. So, the pii graph extracted at the end of Phase 1 is guaranteed to be semantically equivalent to the original program.
Phase 2 is Sound: The memory allocation phase only fills in addresses; it doesn’t change the computation itself. The way it constructs the final assembly from the pii graph is proven to preserve the program’s meaning.

Since both phases are sound, the entire compilation from the original graph to the final assembly is sound. What goes in is what comes out, just compiled.

Proving Completeness (“If a solution exists, we will find it”)

This is more complex but follows a beautiful logical argument (combining Lemmas 2, 4, and 5):

First, prove that any possible valid assembly program can be mapped back to a pii graph (Lemma 5). This means our intermediate “plan” format is expressive enough to capture any solution.
Next, prove that our exploration in Phase 1 is guaranteed to find every possible pii graph, including the one that corresponds to the solution (Lemma 2). It uses a bounded enumeration strategy that ensures it will eventually find any valid plan, no matter how complex.
Finally, prove that if a pii graph has a valid memory allocation, Phase 2 is guaranteed to find it (Lemma 4). This is because it considers all possible instruction orderings and the CP-SAT solver is complete for the constraints given.

The final argument: If a valid assembly program exists, then by (1) a corresponding pii graph must exist. By (2), our Phase 1 will eventually find this pii graph. And by (3), our Phase 2 will successfully find a memory allocation for it. Therefore, the compiler will successfully produce a valid output. ACT will not fail if a solution is possible.

But how long will the “completeness” requires? Seems no guarantees.

Thinking

The story is cool. We always want an automatic compiler generation.
The problem is “what is the limit of the automatic generation” and “how good will the automatic one be”?
- I think ACT is clever to focus on problem not that complex. Just Allo-QKV / Gemini / Intel AMX. The experiments look already rich. Within the range, ACT is a good one.
- Can ACT handle more complex tasks? I think tiling, data layout, fusion, communication, and etc. are also non-negligible. Tiling appears but does not get explained in the paper. I need to run the code for confirmation.