APS/Aquas
Holistic MLIR-based ASIP hardware-software co-design framework
Aquas is a holistic MLIR-based framework for automated ASIP (Application-Specific Instruction-Set Processor) hardware-software co-design. It enhances synthesis with burst-capable DMA and HLS optimizations, and introduces an e-graph-based retargetable compiler for automatic ISAX adoption.
┌────────────────────────────────────────────────────────────────────┐
│ Aquas Framework (MLIR) │
├────────────────────────────────────────────────────────────────────┤
│ │
│ CADL C (App) │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌─────────────────────────────────┐ │
│ │ Hardware Synthesizer │ │ Retargetable Compiler │ │
│ │ ┌────────────────┐ │ │ ┌───────────┐ ┌───────────┐ │ │
│ │ │ aquas dialect │ │ │ │ MLIR │◄─►│ e-graph │ │ │
│ │ │ + affine/scf │ │ │ └─────┬─────┘ └─────┬─────┘ │ │
│ │ └───────┬────────┘ │ │ │ │ │ │
│ │ │ optimize │ │ Internal External │ │
│ │ ▼ │ │ Rewrites Rewrites │ │
│ │ ┌────────────────┐ │ │ │ │ │ │
│ │ │ HECTOR (tor) │ │ │ └───────┬───────┘ │ │
│ │ │ + scheduling │ │ │ ▼ │ │
│ │ └───────┬────────┘ │ │ ┌─────────────────────────┐ │ │
│ │ │ │ │ │ Skeleton-Component │ │ │
│ │ ▼ │ │ │ Pattern Matching │ │ │
│ │ ┌────────────────┐ │ │ └───────────┬─────────────┘ │ │
│ │ │ RTL (CIRCT) │ │ │ ▼ │ │
│ │ └────────────────┘ │ │ LLVM IR → Binary │ │
│ └──────────────────────┘ └─────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Rocket/BOOM Core + RoCC Adapter │ │
│ │ ┌─────────┐ ┌─────────────────┐ ┌───────────────────────┐ │ │
│ │ │ L1I/D$ │ │ Burst DMA Engine│ │ Banked Scratchpad Mem │ │ │
│ │ └─────────┘ │ (TileLink-UH) │ │ (partition-aware) │ │ │
│ │ └─────────────────┘ └───────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
CADL with Optimization Directives
Aquas extends CADL with blockwise memory access and synthesis directives:
#[partition_array([0],[4],"C")] // Cyclic partition into 4 banks
static mat: [i32; 16];
#[partition_array([0],[4],"C")]
static vec: [i32; 4];
rtype gemv(rs1: u5, rs2: u5, rd: u5) {
let ia: u32 = _irf[rs1];
let oa: u32 = _irf[rs2];
mat[0+:] = _blockld[ia +:16]; // Burst load 16 elements
vec[0+:] = _blockld[ia+64 +:4]; // Burst load 4 elements
with i: u32 = (0, i+1) do {
acc = 0;
#[unroll(4)] // Full unroll inner loop
with j: u32 = (0, j_) do {
acc += mat[i*4+j] * vec[j];
} while (j_ < 4);
res[i] = acc;
} while (i + 1 < 4);
_irf[rd] = 0;
}
Fast Memory Access via DMA
Aquas synthesizes a burst-capable DMA engine to overcome memory bottlenecks:
| Access Method | Latency | Throughput | Use Case |
|---|---|---|---|
| RoCC port (single-shot) | 2-3 cycles/elem | Low | Small transfers |
| Burst DMA (TileLink-UH) | 15 cycles init | 1 elem/cycle sustained | Large blocks |
Implementation selection via ILP optimization:
min Σ t_bur(b)·x_bur,b + t_ss·x_ss
s.t. Σ b·x_bur,b + d_ss·x_ss ≥ D
Partition-aware access: DMA distributes each 64-bit word across multiple banks in one cycle.
E-Graph-Based Retargetable Compiler
Bidirectional MLIR ↔ E-graph Translation
- MLIR → e-graph: Operations become e-nodes; blocks become
tuple(...)of roots - E-graph → MLIR: Witness extraction reconstructs SSA form
Hybrid Rewriting
| Rewrite Type | Mechanism | Purpose |
|---|---|---|
| Internal | Egglog fixpoint reasoning | Dataflow equivalences (e.g., x<<2 ⇝ x*4) |
| External | MLIR passes via e-graph | Control-flow transforms (tiling, unrolling) |
Skeleton-Component Pattern Matching
ISAXs are decomposed into:
- Skeleton: Control structure (loop nesting, trip counts)
- Components: Dataflow patterns rooted at side-effect nodes
ISAX gemv:
Skeleton: for i { for j*2 { ... } }
Components: [yield(acc1,acc2), store(c_ptr)]
Matching engine: tag components via Egglog rules → skeleton matcher validates structure → emit ISAX node.
Hardware Synthesis Flow
CADL ──► Pre-Opt MLIR ──► Optimize ──► Schedule ──► FIRRTL ──► Verilog
│ │ │
│ (affine │ (modulo └──► CIRCT
│ raises) │ sched)
│ │
└── aquas dialect ops: readrf, writerf, blockload, memstore
Dynamic pipeline elaboration: Each stage is a transaction with valid-ready handshakes. Loops decompose into entry/body/next transactions.
Links
- Project Website with tutorials and documentation
- GitHub
- Built on MLIR, CIRCT, and HECTOR