I was doing some AI training at work and had a really great question from one of the attendees. They were asking about why AIs and especially AI coding tools have a reputation for getting "stupid" as their context window fills up vs. right after you clear the context. I was super curious about this topic so of course I asked Claude Code.
Attention dilution. Transformers use self-attention, where every token attends to every other token. As the context grows, the model's attention gets spread thinner. Important instructions or context from earlier in the conversation get "drowned out" by the sheer volume of later tokens. The model doesn't forget — it just pays less attention to things that are far away or surrounded by noise.
The "lost in the middle" problem. Research (notably from Stanford/Berkeley) has shown that LLMs disproportionately attend to tokens near the beginning and end of the context window, while information in the middle gets neglected. So if your key spec or constraint was stated 30K tokens ago and buried under conversation, the model is measurably worse at retrieving and applying it. This seems like a potential bug that could be fixed with tuning of how they implement attention.
Accumulation of conflicting signals. In a long coding session, you might refactor an approach, change your mind about an API, or correct the model. All those old, superseded instructions are still sitting in the context. The model doesn't have a clean mechanism to "invalidate" earlier context — it just has a big bag of tokens, some of which contradict each other. This creates confusion that compounds over time.
Positional encoding degradation. Most models use positional encodings (like RoPE) that were trained primarily on shorter sequences, then extended to longer ones. Performance at the edges of the supported context length is empirically worse — the model's sense of "where things are" in the sequence gets fuzzier.
Softmax concentration. The softmax in attention produces a probability distribution. With more keys to attend to, the distribution flattens out, making it harder for the model to sharply focus on the most relevant pieces. It's like trying to pick out one voice in a room that keeps getting more crowded.
This is exactly why spec-driven development and context window management matter so much. Strategies like keeping a concise, authoritative spec at the top of context, periodically starting fresh sessions, and being disciplined about what goes into the context window aren't just nice-to-haves — they're compensating for a fundamental architectural limitation. The best AI-assisted developers treat the context window as a scarce resource, not a bottomless notepad.
It also raises some other questions. Some of the advanced models with 1M context windows can be very helpful when tackling big problems, but do they suffer from some of these problems even more since they still have some of the above issues even when they are only 20% full?
The degradation is a function of absolute token count, not percentage of the window used. A 1M context model at 20% full still has 200K tokens of context, which is a massive amount of text. The attention mechanism doesn't care that you have 800K tokens of headroom left — it's still trying to attend across 200K tokens, and all the same problems apply:
Attention is still diluted across 200K tokens
The "lost in the middle" effect is still happening across that 200K span
Softmax is still flattening over all those keys
Any contradictory or stale content in those 200K tokens is still creating noise
Think of it like a warehouse. Having a bigger warehouse doesn't make it easier to find a specific item — it arguably makes it harder. A 1M context window at 20% full is like a warehouse that's "only" the size of a football field instead of five football fields. You still can't find anything.
The real value of large context windows isn't sustained, ongoing conversation — it's ingestion. They're great for things like "read this entire codebase and answer a question about it" or "analyze these 50 files together." The model gets one big look at a lot of information and produces a focused output. That's very different from a long iterative coding session where context accumulates turn by turn with corrections, dead ends, and superseded approaches mixed in.
Large context windows expand what you can load, but they don't change the fundamental discipline around what you should keep in context during iterative work. A 1M window gives you a bigger workbench, but a messy workbench is still a messy workbench regardless of its size.
Comments
Leave a Comment