Hierarchical SSM · Three Timescales · O(1) Memory

Context grows.
Complexity
stays flat.

Harmonic is a three-level hierarchical state space model for language modeling. It processes fast, medium, and slow temporal patterns in parallel — outperforming Transformers at long context while using constant memory at inference.

Scroll to explore
O(1)
Inference memory — constant regardless of context length
0
Advantage over Transformer at seq=32K (bpt, enwiki8)
0
Hierarchy levels — τ = [4, 32, 128] timescales
64K+
Context tokens — Transformer and Mamba OOM at 64K. Harmonic trains.
Architecture

Three timescales.
One forward pass.

Each level operates at a different temporal resolution — capturing local syntax, phrase structure, and long-range discourse simultaneously. Inter-level signals pass compressed states upward and refined predictions downward.

LEVEL 01 — τ ≈ 4
Fast: Local Syntax
Processes every token with a short decay timescale. Captures character-level patterns, punctuation, local n-gram statistics. Output is mean-pooled (stride 4) and passed up to Level 02.
LEVEL 02 — τ ≈ 32
Medium: Phrase Structure
Operates on 4× downsampled input. Maintains coherence across clauses and sentences. The decay gate at τ≈32 spans roughly one sentence at typical token rates.
LEVEL 03 — τ ≈ 128
Slow: Long-Range Dependencies
Operates on 16× compressed input, maintaining a compact world model of the document. Its 128-step receptive field covers thousands of tokens. Inter-level error signals — the difference between what each level predicted and what the level below produced — drive hierarchical refinement without requiring attention.
Experimental Results

Better at every
context length

Equal token budget (65.5M tokens), equal parameter count. The advantage is not an artifact of more training — at 5× the headline budget the crossover holds.

bpt vs sequence length · enwiki8 · equal 65.5M token budget · lower is better · OOM = out of memory on H100 80GB
SeqTransformerMambaHarmonicH–TF gap
1 0246.6626.6166.571+1.4%
2 0486.6576.5326.426+3.5%
4 0967.0456.7406.687+5.1%
8 1926.7876.4226.333+6.7%
16 3846.8736.2866.196+9.9%
32 7687.2596.5496.433+11.4%
65 536OOMOOM6.169

Five independent seeds, seq=8192. Confidence intervals do not overlap. The ranking Harmonic < Mamba < Transformer is consistent across all seeds.

mean ± std bpt · seq=8192 · 5 random seeds · enwiki8
Harmonic
6.515
± 0.163 bpt
Mamba
6.575
± 0.155 bpt
Transformer
7.009
± 0.159 bpt

At 20K steps (5× the headline budget), Harmonic still wins at seq=8K. Transformer wins at short context — expected, attention is optimal for short sequences.

training loss (bpt) vs steps · seq=8192 · enwiki8

The hierarchy is critical. Flat timescales (all τ equal) cost +0.50 bpt. Removing inter-level prediction-error signals has negligible effect (≤0.022 bpt) — the timescale structure itself drives performance.

ablation bpt at seq=8192 · lower is better
Stateful Inference

O(1) memory.
Unlimited documents.

Harmonic carries its raw SSM state across chunk boundaries — enabling truly unbounded context with a fixed memory footprint. Stateful fine-tuning consistently improves performance.

Harmonic Stateful · seq=1024
6.582 bpt
↓ 4.6% vs no-carry baseline
Carrying raw SSM state between chunks lets the model see beyond its training context window.
Harmonic Stateful · seq=8192
6.745 bpt
↓ 1.4% vs no-carry baseline
At 8K context, Harmonic's internal hierarchy already captures most long-range dependencies — stateful carry adds on top.
vs Mamba Stateful · seq=1024
6.582 vs 6.646
Harmonic wins by 0.064 bpt
Both architectures benefit from state carry. Harmonic stateful beats Mamba stateful at short contexts.
Fixed inference state
512 floats/level
Any document length, constant footprint
Zero KV-cache overhead. Stream arbitrarily long documents on any device.
bpt comparison: no-carry vs inference-only vs trained stateful · seq=1024 and seq=8192
Mechanism

Hierarchical states,
not attention maps

01
Parallel Scan
Each level runs a data-dependent SSM scan in parallel via a Triton kernel (85× speedup on H100). The decay gate A(t) is input-dependent — fast patterns are forgotten quickly, slow patterns are retained.
02
Compress & Signal
Level 1 output is mean-pooled (stride 4) and passed to Level 2. The inter-level signal is the difference between what Level 2 predicted and what Level 1 produced — only the unexpected content propagates upward.
03
Weighted Combine
All three levels are upsampled back to sequence length and combined via learned weights. The final representation blends fast local and slow global signals in a single learned mixture.
training throughput (tokens/sec) vs sequence length · H100 · Harmonic O(n) vs Transformer O(n²)
Community Feedback

Antonio Orvieto
Researcher, MPI-IS

Martin Jaggi
Professor, EPFL
Sounds very interesting! I would suggest you try to get this peer reviewed directly for e.g. a workshop submission.
— Antonio Orvieto, MPI-IS · May 2026
Paper & Code

Read the
preprint

Full architecture details, training procedure, ablation studies, and reproducible experiments. arXiv submission pending moderation.