AI intel digest

How Transformers Finally Ate Vision – Isaac Robinson, Roboflow

The video explains the shift in computer vision from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs)

2026-05-0830 min read6,057 words11 facts · 0 assumptions

Start here

Executive summary

1. SUMMARY (3-5 sentences) The video explains the shift in computer vision from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) as the dominant architecture. The speaker, Isaac Robinson, argues that ViTs, despite lacking inherent inductive biases for images and having higher computational complexity (n^4 scaling), ultimately won due to massive, ViT-specific pretraining techniques (like MAE and DINO) and the ability to leverage speedups (like Flash Attention) developed for the booming Large Language Model (LLM) industry. This evolution is traced through models like Swin, ConvNeXt, and Hiera, which attempted to reintroduce inductive biases but were eventually outpaced by the simplicity and scalability of the original ViT when combined with advanced pretraining. The talk highlights the tradeoff between architectural inductive bias and learned bias from pretraining, concluding that the latter, enabled by scale and infrastructure, proved superior. Finally, it addresses the deployment challenges of large foundation models and introduces RF-DETR as a solution for creating flexible, hardware-efficient models without losing the benefits of ViT pretraining. 2. KEY FACTS (5-10 bullets) FACT: Vision Transformers (ViTs) split images into patches (originally 16x16) and use learned positional encodings. | EVIDENCE: "We take our image, we split it into uh patches. 16 by 16 was the original. And we add a learned positional encoding, and then we throw that into a transformer, and that's it." | CONFIDENCE: HIGH FACT: ViTs have n^4 compute scaling with resolution. | EVIDENCE: "We end up actually with n to the fourth power with the resolution uh compute scaling." | CONFIDENCE: HIGH FACT: Swin Transformer reduces compute to n^2 by using shifted window attention, adding a locality inductive bias. | EVIDENCE: "This actually gets us down to n-squared if your window size is independent of your resolution, and it adds a locality inductive bias following the convolutional net." | CONFIDENCE: HIGH FACT: ConvNeXt, a convolutional network incorporating transformer learnings (patchify, layer norm, hierarchical structure), beat ViT and Swin on ImageNet. | EVIDENCE: "Turns out that beats VIT and Swin when you apply it on the uh standard ImageNet reference." | CONFIDENCE: HIGH FACT: Hiera showed a speedup over ViT for the same accuracy, but this advantage disappeared when Flash Attention was used. | EVIDENCE: "Hera explicitly showed a speedup for the same accuracy versus VIT, but they have a note in their paper where they where they say, 'Okay, we see the speed up and we're not going to measure with flash attention.' So, then you add back in flash attention, suddenly it doesn't really matter." | CONFIDENCE: HIGH FACT: Masked Autoencoders (MAE) is a ViT-specific pretraining technique where patches are dropped and the model learns to reconstruct them. | EVIDENCE: "Here we're using uh uh MAE, masked autoencoder... you take your image, you take your patches, you drop a bunch of the patches, and you ask the the model to reconstruct what would have been in the patches just based on the context." | CONFIDENCE: HIGH FACT: MAE cannot be applied to convolutional networks due to their patch invariance. | EVIDENCE: "But, you can't actually apply MAE to a convolutional network. How do you drop out a patch when you're doing this convolution that's invariant across patches?" | CONFIDENCE: HIGH FACT: DINOv2/DINOv3 pretraining produces rich feature maps, with self-supervised learning approaching supervised learning performance on linear probes. | EVIDENCE: "The self-supervised learning objective is is catching up with the best that we have from supervised learning, and this is via linear probe." | CONFIDENCE: HIGH FACT: SAM 3 uses a massively pre-trained ViT backbone and has 800 million parameters, taking 300ms to run on a T4 GPU. | EVIDENCE: "SAM 3 is this very very powerful thing, but is also 800 million parameters. It takes 300 milliseconds to run on a T4 GPU." | CONFIDENCE: HIGH FACT: RF-DETR achieves a 40x speedup for the same accuracy compared to fine-tuning SAM 3, and a 15x speedup with meaningful improvement. | EVIDENCE: "We see about a 40x speed up for the same accuracy versus fine-tuning SAM 3, uh and for merely a 15x speed up, we get a a meaningful improvement." | CONFIDENCE: HIGH FACT: RF-DETR uses neural architecture search to modify a foundation model and generate a family of high-performance models. | EVIDENCE: "We just modify the foundation model using neural architecture search such that we generate an entire family of uh high-performance models in in one go." | CONFIDENCE: HIGH 3. KEY IDEAS (5-10 bullets) IDEA: The core tension in vision architecture is between inherent inductive bias (CNNs) and learned bias from pretraining (ViTs). | REASONING: The speaker contrasts CNNs' excellent built-in inductive biases (like translation invariance) with ViTs' lack thereof, then shows how massive pretraining (MAE, DINO) allows ViTs to learn these biases and more. | IMPLICATION: Future architecture design may prioritize scalability and pretraining compatibility over hand-engineered inductive biases. IDEA: Pretraining can recover and even surpass inductive biases that are hard-coded into architectures. | REASONING: Meta's Hiera stripped out inductive biases one by one, used MAE pretraining, and the model learned the biases back, achieving speedups until Flash Attention equalized performance. | IMPLICATION: The role of pretraining is not just to learn features, but to learn structural biases, making simpler, more scalable architectures viable. IDEA: The success of ViTs is heavily dependent on borrowing infrastructure and optimizations from the LLM world. | REASONING: The speaker notes that Flash Attention, developed for LLMs, eliminated the speed advantage of more complex vision architectures like Hiera, cementing the ViT's dominance. | IMPLICATION: Advances in one domain (LLMs) can act as force multipliers in another (vision), creating a flywheel effect. IDEA: Deployment flexibility is as critical as benchmark performance for practical AI systems. | REASONING: The speaker emphasizes that massive foundation models like SAM 3 are often unusable on edge devices, creating a need for methods like RF-DETR that can adapt these models to hardware constraints. | IMPLICATION: The next frontier in vision AI is not just building bigger models, but building systems that can dynamically scale to diverse deployment environments. IDEA: The evolution of vision backbones follows a pattern of increasing complexity followed by a return to simple, scalable solutions. | REASONING: The speaker traces a path from ViT to Swin (adding locality) to ConvNeXt (returning to convolutions with transformer lessons) to Hiera (stripping biases) and back to the simple, pretrainable ViT. | IMPLICATION: In AI, the "simple thing that scales well" often wins after a period of architectural experimentation. IDEA: ViT-specific pretraining techniques (MAE, DINO) are a key moat that convolutional networks cannot easily cross. | REASONING: The speaker explicitly states that MAE cannot be applied to CNNs due to their structural properties, giving ViTs an exclusive advantage in self-supervised learning at scale. | IMPLICATION: This creates a path dependency where the ecosystem invests more in ViT infrastructure, further solidifying its dominance. 4. KEY QUOTES (3-7 bullets) - "Vision used to belong to CNNs... The answer runs through pretraining, scaling, borrowed infrastructure from the LLM world, and the long arc back to the simple architecture that scales best." (Video Description/Speaker summary) - "A person in an image is a person regardless of whether they're in the upper left or the bottom right." (On CNN inductive bias) - "The thing that is in the upper left can have a totally different activation pattern if it's in the bottom right." (On ViT lack of inductive bias) - "It is because of massive VIT-specific pretraining, and then we get to borrow a lot of speedups and infrastructure from the fact that LLMs are blowing up." (On why ViTs won) - "We're going to strip out the biases one at a time... and we're going to use pretraining to learn the bias instead." (On Hiera/MAE approach) - "You can't actually apply MAE to a convolutional network. How do you drop out a patch when you're doing this convolution that's invariant across patches?" (On ViT-specific pretraining advantage) - "No deployment flexibility means that we have these one-size-fits-all models... SAM 3 is this very very powerful thing, but is also 800 million parameters. It takes 300 milliseconds to run on a T4 GPU." (On deployment challenges) 5. SIGNAL POINTS (5-8 bullets) - ViTs won vision not because of architectural superiority, but because of massive, ViT-specific pretraining (MAE, DINO) and borrowed LLM infrastructure (Flash Attention). - The computational disadvantage of ViTs (n^4 scaling) was neutralized by Flash Attention, making simpler architectures competitive again. - MAE and DINO are ViT-exclusive pretraining techniques; CNNs structurally cannot use MAE, giving ViTs an unassailable learning advantage at scale. - The evolution of vision backbones (ViT -> Swin -> ConvNeXt -> Hiera -> ViT) demonstrates that pretrainable simplicity beats engineered complexity. - SAM 3 (800M params, 300ms on T4) exemplifies the deployment crisis: foundation models are too large and slow for edge devices. - RF-DETR achieves 40x speedup over SAM 3 fine-tuning with the same accuracy by using neural architecture search on a foundation model, solving the deployment flexibility problem. - The combination of massive pretraining + LLM speedups + deployment-aware architecture search represents the "final nail in the coffin" for classical CNNs in vision. 6. SOURCES MENTIONED - ViT (Vision Transformer): Original transformer-based vision model using 16x16 patches and learned positional encodings. - Swin Transformer: Uses shifted window attention to reduce compute to n^2 and add locality inductive bias. - ConvNeXt: Convolutional network that incorporated transformer design principles (patchify, layer norm, hierarchical structure) and beat ViT/Swin on ImageNet. - Hiera (Meta): Stripped inductive biases from a transformer model one at a time, used MAE pretraining to recover performance; showed speedups until Flash Attention was applied. - MAE (Masked Autoencoder): ViT-specific pretraining technique (similar to BERT) where patches are masked and reconstructed. - DINOv2 / DINOv3: Self-supervised pretraining techniques for ViTs that produce rich, semantically meaningful feature maps. - SAM / Mobile SAM / SAM 2 / SAM 3: Segment Anything Model series. SAM 3 uses a massively pre-trained ViT backbone (800M params). - Flash Attention: Optimization from the LLM world that eliminated the speed advantage of architectures like Hiera over ViT. - RF100VL: Dataset introduced by Roboflow to measure foundation model transfer to downstream object detection tasks. - RF-DETR: Roboflow's model that uses neural architecture search on a foundation model to create a family of deployment-flexible, high-performance models. - JEPA / V-JEPA: Mentioned as alternative foundation models; speaker notes JEPA doesn't outperform others on image tasks, and V-JEPA hasn't seen meaningful downstream video transfer yet. 7. VERDICT This video is worth watching for anyone tracking the evolution of computer vision architectures and the practical deployment of foundation models. The unique signal is the clear, evidence-based narrative of *why* Transformers ate vision—not just because they are "general purpose," but because of a specific convergence of ViT-exclusive pretraining techniques (MAE, DINO), the neutralization of their computational disadvantages via LLM-derived optimizations (Flash Attention), and the resulting path dependency that made CNNs obsolete. The speaker's insider perspective from Roboflow adds practical weight to the deployment critique, particularly the demonstration that raw benchmark wins (like SAM 3) are meaningless without deployment flexibility, and how RF-DETR addresses this. It avoids the typical hype and provides a structural understanding of the current vision landscape. --- Count: 11 facts, 0 assumptions, 0 demonstrations. Signal Density: 85/100. The content is highly focused on architectural evolution, specific model performance, and the interplay between pretraining and inductive bias, with minimal fluff or speculation.

What matters

Signal points

1
ViTs won vision not because of architectural superiority, but because of massive, ViT-specific pretraining (MAE, DINO) and borrowed LLM infrastructure (Flash Attention).
2
The computational disadvantage of ViTs (n^4 scaling) was neutralized by Flash Attention, making simpler architectures competitive again.
3
MAE and DINO are ViT-exclusive pretraining techniques; CNNs structurally cannot use MAE, giving ViTs an unassailable learning advantage at scale.
4
The evolution of vision backbones (ViT -> Swin -> ConvNeXt -> Hiera -> ViT) demonstrates that pretrainable simplicity beats engineered complexity.
5
SAM 3 (800M params, 300ms on T4) exemplifies the deployment crisis: foundation models are too large and slow for edge devices.
6
RF-DETR achieves 40x speedup over SAM 3 fine-tuning with the same accuracy by using neural architecture search on a foundation model, solving the deployment flexibility problem.
7
The combination of massive pretraining + LLM speedups + deployment-aware architecture search represents the "final nail in the coffin" for classical CNNs in vision.
8
6. SOURCES MENTIONED

Interpretation

Key ideas

The core tension in vision architecture is between inherent inductive bias (CNNs) and learned bias from pretraining (ViTs).

Why: The speaker contrasts CNNs' excellent built-in inductive biases (like translation invariance) with ViTs' lack thereof, then shows how massive pretraining (MAE, DINO) allows ViTs to learn these biases and more.

Implication: Future architecture design may prioritize scalability and pretraining compatibility over hand-engineered inductive biases.

Pretraining can recover and even surpass inductive biases that are hard-coded into architectures.

Why: Meta's Hiera stripped out inductive biases one by one, used MAE pretraining, and the model learned the biases back, achieving speedups until Flash Attention equalized performance.

Implication: The role of pretraining is not just to learn features, but to learn structural biases, making simpler, more scalable architectures viable.

The success of ViTs is heavily dependent on borrowing infrastructure and optimizations from the LLM world.

Why: The speaker notes that Flash Attention, developed for LLMs, eliminated the speed advantage of more complex vision architectures like Hiera, cementing the ViT's dominance.

Implication: Advances in one domain (LLMs) can act as force multipliers in another (vision), creating a flywheel effect.

Deployment flexibility is as critical as benchmark performance for practical AI systems.

Why: The speaker emphasizes that massive foundation models like SAM 3 are often unusable on edge devices, creating a need for methods like RF-DETR that can adapt these models to hardware constraints.

Implication: The next frontier in vision AI is not just building bigger models, but building systems that can dynamically scale to diverse deployment environments.

The evolution of vision backbones follows a pattern of increasing complexity followed by a return to simple, scalable solutions.

Why: The speaker traces a path from ViT to Swin (adding locality) to ConvNeXt (returning to convolutions with transformer lessons) to Hiera (stripping biases) and back to the simple, pretrainable ViT.

Implication: In AI, the "simple thing that scales well" often wins after a period of architectural experimentation.

ViT-specific pretraining techniques (MAE, DINO) are a key moat that convolutional networks cannot easily cross.

Why: The speaker explicitly states that MAE cannot be applied to CNNs due to their structural properties, giving ViTs an exclusive advantage in self-supervised learning at scale.

Implication: This creates a path dependency where the ecosystem invests more in ViT infrastructure, further solidifying its dominance.

Evidence

Key facts

Vision Transformers (ViTs) split images into patches (originally 16x16) and use learned positional encodings.

HIGH

Evidence: We take our image, we split it into uh patches. 16 by 16 was the original. And we add a learned positional encoding, and then we throw that into a transformer, and that's it.

ViTs have n^4 compute scaling with resolution.

HIGH

Evidence: We end up actually with n to the fourth power with the resolution uh compute scaling.

Swin Transformer reduces compute to n^2 by using shifted window attention, adding a locality inductive bias.

HIGH

Evidence: This actually gets us down to n-squared if your window size is independent of your resolution, and it adds a locality inductive bias following the convolutional net.

ConvNeXt, a convolutional network incorporating transformer learnings (patchify, layer norm, hierarchical structure), beat ViT and Swin on ImageNet.

HIGH

Evidence: Turns out that beats VIT and Swin when you apply it on the uh standard ImageNet reference.

Hiera showed a speedup over ViT for the same accuracy, but this advantage disappeared when Flash Attention was used.

HIGH

Evidence: Hera explicitly showed a speedup for the same accuracy versus VIT, but they have a note in their paper where they where they say, 'Okay, we see the speed up and we're not going to measure with flash attention.' So, then you add back in flash attention, suddenly it doesn't really matter.

Masked Autoencoders (MAE) is a ViT-specific pretraining technique where patches are dropped and the model learns to reconstruct them.

HIGH

Evidence: Here we're using uh uh MAE, masked autoencoder... you take your image, you take your patches, you drop a bunch of the patches, and you ask the the model to reconstruct what would have been in the patches just based on the context.

MAE cannot be applied to convolutional networks due to their patch invariance.

HIGH

Evidence: But, you can't actually apply MAE to a convolutional network. How do you drop out a patch when you're doing this convolution that's invariant across patches?

Show 4 more facts

DINOv2/DINOv3 pretraining produces rich feature maps, with self-supervised learning approaching supervised learning performance on linear probes.

HIGH

Evidence: The self-supervised learning objective is is catching up with the best that we have from supervised learning, and this is via linear probe.

SAM 3 uses a massively pre-trained ViT backbone and has 800 million parameters, taking 300ms to run on a T4 GPU.

HIGH

Evidence: SAM 3 is this very very powerful thing, but is also 800 million parameters. It takes 300 milliseconds to run on a T4 GPU.

RF-DETR achieves a 40x speedup for the same accuracy compared to fine-tuning SAM 3, and a 15x speedup with meaningful improvement.

HIGH

Evidence: We see about a 40x speed up for the same accuracy versus fine-tuning SAM 3, uh and for merely a 15x speed up, we get a a meaningful improvement.

RF-DETR uses neural architecture search to modify a foundation model and generate a family of high-performance models.

HIGH

Evidence: We just modify the foundation model using neural architecture search such that we generate an entire family of uh high-performance models in in one go.

Memorable lines

Quotes

“Vision used to belong to CNNs... The answer runs through pretraining, scaling, borrowed infrastructure from the LLM world, and the long arc back to the simple architecture that scales best." (Video Description/Speaker summary)”

“A person in an image is a person regardless of whether they're in the upper left or the bottom right." (On CNN inductive bias)”

“The thing that is in the upper left can have a totally different activation pattern if it's in the bottom right." (On ViT lack of inductive bias)”

“It is because of massive VIT-specific pretraining, and then we get to borrow a lot of speedups and infrastructure from the fact that LLMs are blowing up." (On why ViTs won)”

“We're going to strip out the biases one at a time... and we're going to use pretraining to learn the bias instead." (On Hiera/MAE approach)”

“You can't actually apply MAE to a convolutional network. How do you drop out a patch when you're doing this convolution that's invariant across patches?" (On ViT-specific pretraining advantage)”