Some reading on recursive reasoning transformers

2025-04-04

Recurrent transformers for reasoning

My proposed latent reasoning architecture

I have a hunch that “reasoning” LLMs don’t actually need natural language intermediate reasoning steps. I’m imagining a recurrent latent reasoning chain where the output at each step is the input to the next. This would not be efficient for parallel training, but my understanding is that these models are trained via RL anyways, which, as I understand it, must be autoregressive.

I’ve heard of techniques that allow LMs to use a “think” token or a “backspace” token. I’m also aware of a paper that used “…” instead of chain of thought. Here is a (lightly) annotated list of related papers.

Looped Transformers

Looped Transformers for Length Generalization is a theoretical paper that constructs length generalizing transformers using the “looped transformer” architecture. Looped Transformers proposed the original architecture. This is a theoretical paper that compares looped architecture aspects of Turing machines.

Universal transformers

The “original” recurrent transformer paper. Repeated application of the same transformer block until “stopping condition” is met.

Deep equilibrium models

I really like this model idea. This is something I have thought about in the past: repeated applying a transformation until it reaches a stable state as a stopping condition. The implicit differentiation is a cool trick that makes this possible. I wonder how this could work in a transformer architecture where the hidden state output from the previous step is used as the input to the next.

Converting a trained transformer to an RNN

I did a class project on this, using polynomial fitting to try to get a better initializaton for a linear RNN to replace the softmax attention layer in a pretrained transformer. It didn’t work well, but I have some ideas for what might work next.

Replace the softmax layer with multiple, stacked linear RNN layers.
Train an RNN to mimic the IO behavior of the KV cache.