APS/Aquas

Holistic MLIR-based ASIP hardware-software co-design framework

Aquas is a holistic MLIR-based framework for automated ASIP (Application-Specific Instruction-Set Processor) hardware-software co-design. It enhances synthesis with burst-capable DMA and HLS optimizations, and introduces an e-graph-based retargetable compiler for automatic ISAX adoption.

+------------------------------------------------------------------+
|                      Aquas Framework (MLIR)                      |
+------------------------------------------------------------------+
|                                                                  |
|  CADL                              C (App)                       |
|    |                                  |                          |
|    v                                  v                          |
|  +---------------------+    +--------------------------------+   |
|  | Hardware Synthesizer|    |   Retargetable Compiler        |   |
|  | +-----------------+ |    |  +----------+   +----------+   |   |
|  | | aquas dialect   | |    |  |   MLIR   |<->|  e-graph |   |   |
|  | | + affine/scf    | |    |  +-----+----+   +-----+----+   |   |
|  | +--------+--------+ |    |        |              |        |   |
|  |          | optimize |    |   Internal      External       |   |
|  |          v          |    |   Rewrites      Rewrites       |   |
|  | +-----------------+ |    |        |              |        |   |
|  | | HECTOR (tor)    | |    |        +------+-------+        |   |
|  | | + scheduling    | |    |               v                |   |
|  | +--------+--------+ |    |  +------------------------+    |   |
|  |          |          |    |  | Skeleton-Component     |    |   |
|  |          v          |    |  | Pattern Matching       |    |   |
|  | +-----------------+ |    |  +-----------+------------+    |   |
|  | | RTL (CIRCT)     | |    |              v                 |   |
|  | +-----------------+ |    |        LLVM IR -> Binary       |   |
|  +---------------------+    +--------------------------------+   |
|             |                             |                      |
|             v                             v                      |
|  +--------------------------------------------------------+     |
|  |             Rocket/BOOM Core + RoCC Adapter            |     |
|  |  +--------+  +----------------+  +------------------+  |     |
|  |  | L1I/D$ |  |Burst DMA Engine|  |Banked Scratchpad |  |     |
|  |  +--------+  | (TileLink-UH)  |  | (partition-aware)|  |     |
|  |              +----------------+  +------------------+  |     |
|  +--------------------------------------------------------+     |
+------------------------------------------------------------------+

CADL with Optimization Directives

Aquas extends CADL with blockwise memory access and synthesis directives:

#[partition_array([0],[4],"C")]      // Cyclic partition into 4 banks
static mat: [i32; 16];
#[partition_array([0],[4],"C")]
static vec: [i32; 4];

rtype gemv(rs1: u5, rs2: u5, rd: u5) {
    let ia: u32 = _irf[rs1];
    let oa: u32 = _irf[rs2];
    mat[0+:] = _blockld[ia +:16];    // Burst load 16 elements
    vec[0+:] = _blockld[ia+64 +:4];  // Burst load 4 elements
    with i: u32 = (0, i+1) do {
        acc = 0;
        #[unroll(4)]                  // Full unroll inner loop
        with j: u32 = (0, j_) do {
            acc += mat[i*4+j] * vec[j];
        } while (j_ < 4);
        res[i] = acc;
    } while (i + 1 < 4);
    _irf[rd] = 0;
}

Fast Memory Access via DMA

Aquas synthesizes a burst-capable DMA engine to overcome memory bottlenecks:

Access Method Latency Throughput Use Case
RoCC port (single-shot) 2-3 cycles/elem Low Small transfers
Burst DMA (TileLink-UH) 15 cycles init 1 elem/cycle sustained Large blocks

Implementation selection via ILP optimization:

min  Σ t_bur(b)·x_bur,b + t_ss·x_ss
s.t. Σ b·x_bur,b + d_ss·x_ss ≥ D

Partition-aware access: DMA distributes each 64-bit word across multiple banks in one cycle.

E-Graph-Based Retargetable Compiler

Bidirectional MLIR ↔ E-graph Translation

  • MLIR → e-graph: Operations become e-nodes; blocks become tuple(...) of roots
  • E-graph → MLIR: Witness extraction reconstructs SSA form

Hybrid Rewriting

Rewrite Type Mechanism Purpose
Internal Egglog fixpoint reasoning Dataflow equivalences (e.g., x<<2 ⇝ x*4)
External MLIR passes via e-graph Control-flow transforms (tiling, unrolling)

Skeleton-Component Pattern Matching

ISAXs are decomposed into:

  • Skeleton: Control structure (loop nesting, trip counts)
  • Components: Dataflow patterns rooted at side-effect nodes
ISAX gemv:
  Skeleton: for i { for j*2 { ... } }
  Components: [yield(acc1,acc2), store(c_ptr)]

Matching engine: tag components via Egglog rules → skeleton matcher validates structure → emit ISAX node.

Hardware Synthesis Flow

CADL --> Pre-Opt MLIR --> Optimize --> Schedule --> FIRRTL --> Verilog
              |              |            |
              |   (affine    |  (modulo   +--> CIRCT
              |    raises)   |   sched)
              |              |
              +-- aquas dialect ops: readrf, writerf, blockload, memstore

Dynamic pipeline elaboration: Each stage is a transaction with valid-ready handshakes. Loops decompose into entry/body/next transactions.

Related Publications

2025

  1. Preprint
    Aquas: Enhancing Domain Specialization through Holistic Hardware-Software Co-Optimization based on MLIR
    Yuyang Zou, Youwei Xiao, Yansong Xu, and 6 more authors
    2025

2025

  1. ICCAD
    Invited Paper: APS: Open-Source Hardware-Software Co-Design Framework for Agile Processor Specialization
    Youwei Xiao, Yuyang Zou, Yansong Xu, and 6 more authors
    In Proceedings of the 44rd IEEE/ACM International Conference on Computer-Aided Design (ICCAD ’25), 2025

2025

  1. ICCAD
    Clay: High-level ASIP Framework for Flexible Microarchitecture-Aware Instruction Customization
    Weijie Peng*Youwei Xiao*, Yuyang Zou, and 2 more authors
    In Proceedings of the 44rd IEEE/ACM International Conference on Computer-Aided Design (ICCAD ’25), 2025