← blog

training on consumer GPUs

Consumer GPU training and long context — conceptual illustration
Long context on consumer hardware — conceptual

2026 3 29 && ~7 min read && benchmarks && gpu

By Badaramoni Avinash

There is a context length wall in transformer training. It is not theoretical — it is a hard out-of-memory crash. On every consumer GPU we tested, standard self-attention runs out of VRAM somewhere between 16K and 32K tokens. The O(N²) attention matrix simply does not fit.

Wave Field does not have this problem. We trained at 256K context on an RTX 5090 — a $2,000 card. The same context length that causes an instant OOM on a standard transformer ran smoothly, with throughput holding steady from 32K through 256K.


The memory wall

Standard self-attention computes an N × N matrix for every head, at every layer, for every batch element. At 32K context with 12 heads, that single matrix is 32,768 × 32,768 × 12 × 4 bytes ≈ 48 GB. An RTX 5090 has 32 GB. An RTX 3090 has 24 GB. There is nowhere to put it.

Wave Field computes attention through FFT convolution on a 1D field. The memory requirement scales as O(N), not O(N²). At 256K context, the field fits comfortably in 32 GB with room for the rest of the model, the optimizer states, and the activations.


Measured results

All benchmarks run the same 130M-parameter model architecture in float32 precision. Throughput is measured in thousands of tokens per second. "OOM" means the run crashed with a CUDA out-of-memory error before completing a single forward pass.

GPU Context Standard Wave Field
RTX 5090
$2,000 · 32 GB
2K 71K tok/s 152K tok/s
8K 26K tok/s 176K tok/s
32K OOM 161K tok/s
256K OOM 157K tok/s
RTX 3090
$1,500 · 24 GB
2K 32K tok/s 64K tok/s
32K OOM 66K tok/s
256K OOM 66K tok/s
H100
$30,000 · 80 GB
2K 86K tok/s 138K tok/s
32K OOM 183K tok/s
256K OOM 179K tok/s
512K OOM 179K tok/s

Same 130M-parameter model, float32 precision, measured throughput


What the numbers show

Three things stand out.

Standard attention collapses early. At 32K context, every GPU we tested — consumer and datacenter — fails to complete a single forward pass with standard self-attention. The quadratic memory scaling makes anything beyond short context impossible without specialized techniques like FlashAttention or gradient checkpointing, which add engineering complexity and reduce throughput.

Wave Field throughput is flat. On the RTX 5090, throughput at 256K context (157K tok/s) is nearly identical to throughput at 32K context (161K tok/s). The O(N log N) scaling means doubling the context length barely changes the computation cost. This is not a gradual degradation — it is fundamentally different scaling behavior.

Consumer GPUs compete with datacenter hardware. The RTX 5090 running Wave Field at 256K context (157K tok/s) comes within 12% of the H100 at the same context length (179K tok/s). The H100 costs 15× more.


At 32K context: a direct comparison

Comparing the two architectures at 32K context on an RTX 5090 — the last context length where standard attention can be measured at shorter contexts before it OOMs:

Throughput advantage
21.8×
faster (Wave Field vs standard at 8K)
Memory advantage
5.3×
less VRAM used

Standard attention uses 26K tok/s at 8K before it OOMs at 32K. Wave Field runs 161K tok/s at 32K without difficulty. The throughput gap will only widen at longer contexts, because one architecture scales quadratically and the other scales log-linearly.


What this changes

Training long-context models has been a datacenter-only activity. You need H100s or A100s, you need them in quantity, and you need the infrastructure to connect them. A single researcher or a small team cannot afford the hardware, the power, or the cloud compute bills.

Wave Field changes the economics. A single RTX 5090 — the kind of GPU that fits in a desktop workstation — can train at 256K context. A cluster of 8 of them, costing $16,000 total, provides training capacity that would require $240,000 worth of H100s under standard attention.

The architecture does not make datacenter hardware unnecessary. It makes consumer hardware sufficient.