RePo: how transformers re-position tokens by meaning
The problem
Standard RoPE
The MLP GPS
Head assignment
The new map
Sentence: "The capital of France is Paris."
The
pos 1
capital
pos 2
of
pos 3
France
pos 4
is
pos 5
Paris
pos 6
.
2 positions apart — model must "travel" to connect them
The core problem
Standard models assign position by word order (slot), not by meaning.
Related tokens stay far apart in the attention math.
RoPE: each word gets a fixed integer — distance is always ordinal
1
2
3
4
5
6
The
capital
of
France
is
Paris
gap = 2 — fixed, ordinal, cannot compress
Positions are static: 1, 2, 3, 4, 5, 6 — always integers, always ordinal.
Semantically close words still appear far apart in the math.
Inside one transformer layer — RePo adds a GPS side-quest before attention
Paris
token
LayerNorm
stabilize
h
extracted here
MLP GPS factory
4096 → 64 → p
p (64d)
position essence
Attention
Q · K / V
used to rotate
Q and K (RoPE)
h knows context (layer 12 "Paris" knows it's a capital). p encodes that as a decimal position.
Same p, different head weight → different decimal z per head
p_Paris (64d)
position essence
w_A · p
w_B · p
Head A
Geography specialist
Head B
Grammar specialist
z = 2.1
z = 5.9
near France (2.0)
near period (6.0)
Head A (geography lens) — semantic clustering in action
Standard
The·1
capital·2
of·3
France·4
is·5
Paris·6
gap = 2
RePo
France·2.0
Paris·2.1
The·5.5
is·5.6
capital·5.7
gap = 0.1
Geography cluster — snapped together
Filler words — pushed aside
Words don't move in memory — z changes the RoPE angle so attention sees them as adjacent