DeepSeek-OCR: Contexts Optical Compression — Reading Group Reflections

I just finished reading “DeepSeek-OCR: Contexts Optical Compression” with my reading group, and I wanted to turn my notes into a simple, clear write-up. Everything here is directly from the bullet points I wrote down after the discussion.

Links for reference:
DeepSeek-OCR on Hugging Face
Paper on arXiv

My Own Thoughts

I noted right away that the paper isn’t really about OCR in the usual sense. It focuses more on context compression inside the model input, which feels like the real contribution. The model uses an encoder to compress image tokens before passing them along, and that direction is what stood out to me more than the OCR framing.

I also wished the authors had included examples of what the image tokens actually look like. That would have made it easier to understand, in a concrete way, what’s being compressed and how much structure is preserved.

Another thing I’m excited about is seeing this type of model trained with more non-synthetic data. The approach seems promising, and it would be good to see how it performs once it has access to a broader range of real-world examples.

Finally, this model design feels like a possible path toward much larger context windows, maybe even 10× or 100× larger than what we have now.

What the Group Discussed

One point that came up was whether edit distance is a fair metric for this kind of task. Even a small change like losing a comma can completely alter meaning, yet the edit distance between those two outputs is tiny. That made us question how well the metric reflects the mistakes that actually matter.

Someone also mentioned that DeepSeek-OCR had been tested against Azure OCR. The general agreement was that it isn’t an especially fair comparison. Azure’s offering is polished and production-ready, while DeepSeek’s work is much earlier-stage and focused more on the architecture.

We also discussed how training with this architecture at a larger scale could shape what hardware you need for foundation models. Because compression reduces the amount of information passed into the decoder, it could mean less GPU RAM is required during training.

There was also a conversation about what scraping the internet might look like for this kind of system. The idea was something like using Playwright to capture full web pages, then running some simple text analysis on the raw content before ingesting anything.

A final note from the group was that I still need to read the MoE decoder paper from DeepSeek, because that part of the architecture didn’t fully click yet.

This Architecture at Scale

There was interest in how training with this architecture at full scale might play out. Because the encoder compresses so much before passing things into the rest of the model, the whole pipeline might become more efficient. That could change how people think about hardware requirements. Compression can lighten the load on GPU memory, and that has implications for how large foundational models can be trained.

What I Don’t Know

I still don’t understand the MoE decoder portion of the architecture, and I’m looking forward to reading that paper.

I also don’t yet know exactly how well this method holds up once it begins operating on large amounts of non-synthetic data, or what the image tokens look like in practice. Those details matter for understanding where the limits are.