Cayman
Automatic accelerator generation with control flow and data access optimization
Cayman is the first end-to-end framework that synthesizes high-performance custom accelerators with full control flow and data access support, automating both candidate selection and hardware synthesis.
┌────────────────────────────────────────────────────────────────┐
│ Cayman Framework │
├────────────────────────────────────────────────────────────────┤
│ │
│ Application ──► LLVM IR ──┬──► profile/analyze │
│ │ │
│ └──► build wPST │
│ │ │
│ ┌─────────────────────────────────▼─────────────────────────┐│
│ │ Candidate Selection (DP) ││
│ │ ┌─────────┐ ┌─────────────┐ ┌──────────────────────┐ ││
│ │ │ wPST │──►│ Accelerator │──►│ Pareto-optimal │ ││
│ │ │traversal│ │ Model │ │ Solutions │ ││
│ │ └─────────┘ │ • Ctrl-flow │ │ (speedup vs area) │ ││
│ │ │ • Interface │ └──────────────────────┘ ││
│ │ └─────────────┘ ││
│ └────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Accelerator Merging │ │
│ │ DFG₁ + DFG₂ ──► Reconfigurable Datapath + FSMs │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Hardware (Verilog) │
└────────────────────────────────────────────────────────────────┘
Limitations of Prior Work
| Approach | Candidate Selection | Control Flow | Data Access | Hardware Sharing |
|---|---|---|---|---|
| HLS | manual | optimized | specified | / |
| CFU synthesis | auto (DFG only) | / | scalar-only | restricted |
| Off-core accel | auto | sequential | slow | restricted |
| Cayman | auto | optimized | specialized | flexible |
Prior frameworks either exclude control flow/memory access, or only synthesize sequential control with slow interfaces.
Whole-Application Program Structure Tree (wPST)
Cayman extends the traditional PST to capture all SESE (single-entry-single-exit) regions as acceleration candidates:
root
/ \
func0 func1
/ \
loop loop outer
linear / | \
| entry head exit
linear |
loop dot-product
/ | \
head body cond
Two region types:
- bb regions: Basic blocks (dataflow graphs)
- ctrl-flow regions: Loops and conditionals
Data Access Interface Modeling
Cayman systematically models three processor-accelerator interfaces:
| Interface | Mechanism | Best For | Overhead |
|---|---|---|---|
| Coupled | ld/st units stall on access | Simple access | Low |
| Decoupled | AGUs compute addresses independently | Stream patterns in pipelined loops | Medium (AGUs + FIFOs) |
| Scratchpad | Dedicated buffer + DMA sync | High reuse (access count » footprint) | High (buffer + DMA) |
Impact example (dot product sum += x[i]*y[i]):
| Control Flow | Coupled | Decoupled | Scratchpad |
|---|---|---|---|
| Sequential | 6N cycles | 4N cycles | - |
| Pipelined | II=3 | II=1 | - |
| Unrolled(2) | 9(N/2) | - | 4(N/2) |
Candidate Selection Algorithm
Dynamic programming on wPST with heuristic pruning:
def DP(vertex v):
if prune(v, R): return
if v is bb:
F[v] = filter(pareto(accel(v, R)))
else:
F[v] = ∅
for u in v.children:
DP(u)
F[v] = filter(F[v] ⊗ F[u])
if v is ctrl-flow:
F[v] = filter(F[v] ∪ pareto(accel(v, R)))
- F[v]: Pareto-optimal solutions for subtree rooted at v
- ⊗: Combines sibling solutions
- filter: Removes solutions too close in area (keeps O(log A) solutions)
- Complexity: O(N log²A + E)
Accelerator Merging
Cayman enables reusable accelerators across kernels with diverse control flows:
┌─────────────────────────────────────────────────────┐
│ Merged Accelerator │
│ ┌────────────────────────────────────────────────┐ │
│ │ Reconfigurable Datapath (from DFG₁ ∪ DFG₂) │ │
│ │ LD ──► MUL ──► ADD ──► ST │ │
│ │ ↑ ↑ │ │
│ │ MUX MUX ◄── configBits │ │
│ └────────────────────────────────────────────────┘ │
│ ↑ ctrlSignals │
│ ┌────┴────┐ │
│ │ Ctrl │◄── select │
│ └────┬────┘ │
│ ┌──┴──┐ │
│ FSM₁ FSM₂ (standalone FSMs for each kernel) │
│ linear dot-prod │
└─────────────────────────────────────────────────────┘
Key insight: Basic blocks with FP ops and data access dominate area; control FSMs are cheap. Merge datapaths, keep separate FSMs.
Performance Estimation
Speedup calculation without full synthesis:
Speedup = T_all / (T_all - T_cand + Cycle_cand / F)
Where:
- T_all: Total application duration (profiled)
- T_cand: Duration of accelerated kernels (profiled)
- Cycle_cand: Estimated accelerator cycles (from scheduling)
- F: Target frequency
Implementation
Built on LLVM 18:
- wPST: RegionInfoAnalysis pass
- Profiling: Custom instrumentation pass
- Analysis: MemoryDependenceAnalysis + ScalarEvolution
- Synthesis: OpenROAD targeting Nangate45 PDK