← posts

variable heads self-organize

Variable attention heads and learned frequencies — conceptual illustration
Head specialization — conceptual

2026 3 30 && note

By Badaramoni Avinash

Wave heads learn different frequencies; some behave more like long-range routing, others more like local structure. Whether variable head widths help is an implementation detail — figures and tables are not published here.

See GitHub for experiments.