Hierarchical SSM · Three Timescales · O(1) Memory

Context grows.
Complexity
stays flat.

Harmonic is a three-level hierarchical state space model for language modeling. It processes fast, medium, and slow temporal patterns in parallel — outperforming Transformers at long context while using constant memory at inference.

See Results How It Works

Scroll to explore

Architecture

Three timescales.
One forward pass.

Each level operates at a different temporal resolution — capturing local syntax, phrase structure, and long-range discourse simultaneously. Inter-level signals pass compressed states upward and refined predictions downward.

LEVEL 01 — τ ≈ 4

Fast: Local Syntax

Processes every token with a short decay timescale. Captures character-level patterns, punctuation, local n-gram statistics. Output is mean-pooled (stride 4) and passed up to Level 02.

LEVEL 02 — τ ≈ 32

Medium: Phrase Structure

Operates on 4× downsampled input. Maintains coherence across clauses and sentences. The decay gate at τ≈32 spans roughly one sentence at typical token rates.

LEVEL 03 — τ ≈ 128

Slow: Long-Range Dependencies

Operates on 16× compressed input, maintaining a compact world model of the document. Its 128-step receptive field covers thousands of tokens. Inter-level error signals — the difference between what each level predicted and what the level below produced — drive hierarchical refinement without requiring attention.

Experimental Results

Better at every
context length

Equal token budget (65.5M tokens), equal parameter count. The advantage is not an artifact of more training — at 5× the headline budget the crossover holds.

bpt vs sequence length · enwiki8 · equal 65.5M token budget · lower is better · OOM = out of memory on H100 80GB

Seq	Transformer	Mamba	Harmonic	H–TF gap
1 024	6.662	6.616	6.571	+1.4%
2 048	6.657	6.532	6.426	+3.5%
4 096	7.045	6.740	6.687	+5.1%
8 192	6.787	6.422	6.333	+6.7%
16 384	6.873	6.286	6.196	+9.9%
32 768	7.259	6.549	6.433	+11.4%
65 536	OOM	OOM	6.169	—

Five independent seeds, seq=8192. Confidence intervals do not overlap. The ranking Harmonic < Mamba < Transformer is consistent across all seeds.

mean ± std bpt · seq=8192 · 5 random seeds · enwiki8

Harmonic

6.515

± 0.163 bpt

Mamba

6.575

± 0.155 bpt

Transformer

7.009

± 0.159 bpt

At 20K steps (5× the headline budget), Harmonic still wins at seq=8K. Transformer wins at short context — expected, attention is optimal for short sequences.

training loss (bpt) vs steps · seq=8192 · enwiki8

The hierarchy is critical. Flat timescales (all τ equal) cost +0.50 bpt. Removing inter-level prediction-error signals has negligible effect (≤0.022 bpt) — the timescale structure itself drives performance.

ablation bpt at seq=8192 · lower is better

Stateful Inference

O(1) memory.
Unlimited documents.

Harmonic carries its raw SSM state across chunk boundaries — enabling truly unbounded context with a fixed memory footprint. Stateful fine-tuning consistently improves performance.

Harmonic Stateful · seq=1024

6.582 bpt

↓ 4.6% vs no-carry baseline

Carrying raw SSM state between chunks lets the model see beyond its training context window.

Harmonic Stateful · seq=8192

6.745 bpt

↓ 1.4% vs no-carry baseline

At 8K context, Harmonic's internal hierarchy already captures most long-range dependencies — stateful carry adds on top.

vs Mamba Stateful · seq=1024

6.582 vs 6.646

Harmonic wins by 0.064 bpt

Both architectures benefit from state carry. Harmonic stateful beats Mamba stateful at short contexts.

Fixed inference state

512 floats/level

Any document length, constant footprint

Zero KV-cache overhead. Stream arbitrarily long documents on any device.

bpt comparison: no-carry vs inference-only vs trained stateful · seq=1024 and seq=8192

Mechanism

Hierarchical states,
not attention maps

Parallel Scan

Each level runs a data-dependent SSM scan in parallel via a Triton kernel (85× speedup on H100). The decay gate A(t) is input-dependent — fast patterns are forgotten quickly, slow patterns are retained.

Compress & Signal

Level 1 output is mean-pooled (stride 4) and passed to Level 2. The inter-level signal is the difference between what Level 2 predicted and what Level 1 produced — only the unexpected content propagates upward.

Weighted Combine

All three levels are upsampled back to sequence length and combined via learned weights. The final representation blends fast local and slow global signals in a single learned mixture.

training throughput (tokens/sec) vs sequence length · H100 · Harmonic O(n) vs Transformer O(n²)

Context grows.Complexitystays flat.

Three timescales.One forward pass.

Better at everycontext length

O(1) memory.Unlimited documents.

Hierarchical states,not attention maps

Read thepreprint

Context grows.
Complexity
stays flat.

Three timescales.
One forward pass.

Better at every
context length

O(1) memory.
Unlimited documents.

Hierarchical states,
not attention maps

Read the
preprint