FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a SOTA feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which exploits token merging within the visual geometry model to achieve training-free acceleration. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4× speedup over VGGT while mitigating error accumulation in long-sequence scenarios.

With longer input sequences, VGGT's inference speed is mainly limited by Global Attention.

We visualize VGGT's Global Attention maps on the ScanNet dataset. Each image is represented by 1,041 tokens (one camera token, four register tokens, and 1,036 patch tokens from a 28 × 37 grid). The dense self-attention mechanism generates an attention map for every token, and visualizations across tokens and blocks reveal that many of these maps are highly similar.

block

token

frame

block

token

frame

block

token

frame

block

token

frame

We mitigate redundant attention in VGGT by adopting training-free token merging. While well-established in 2D vision, its application to 3D architectures remains underexplored. Since visual geometry depends on cross-image correspondences, we propose three tailored yet simple and effective token merging strategies:

Tokens from the initial frame, which serves as the global reference for the entire scene, are designated as high-priority dst tokens and are exempt from being merged to ensure reconstruction stability.
To maintain global consistency and preserve fine-grained details, we identify and retain the most salient tokens across all frames, allowing them to bypass the merging process entirely and participate directly in the attention computation.
Drawing inspiration from ToMeSD, we implement region-based random sampling within each subsequent frame. This ensures a spatially balanced selection of src and dst tokens, preventing critical information loss in localized regions during consolidation.

FastVGGT delivers substantial acceleration over baseline VGGT across all settings while preserving reconstruction accuracy.

Method	1000		500		300		100
Method	CD ↓	Time ↓	CD ↓	Time ↓	CD ↓	Time ↓	CD ↓	Time ↓
π³	OOM		OOM		OOM		OOM
StreamVGGT	OOM		OOM		OOM		OOM
Fast3R	0.684	397.8s	0.701	97.3s	0.711	34.9s	0.723	4.8s
CUT3R	0.786	34.8s	0.774	18.8s	0.775	11.1s	0.767	3.6s
VGGT*	0.471	724.6s	0.420	177.5s	0.416	131.4s	0.423	9.1s
FastVGGT	0.425	180.7s	0.411	55.2s	0.416	23.8s	0.426	5.4s

Notably, when processing very long sequences (e.g., 1000 images), FastVGGT not only maintains reconstruction fidelity but also significantly mitigates error accumulation.

⚡️ FastVGGT: Training-Free Acceleration of
Visual Geometry Transformer

Abstract

🤏 Bottleneck

💎 Observation

Method

Quantitative Comparison

Qualitative Visualization

Acknowledgements

Citation

🔍 Explore, Capture, Lead in 3D

⚡️ FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

Abstract

🤏 Bottleneck

💎 Observation

Method

Quantitative Comparison

Qualitative Visualization

Acknowledgements

Citation

🔍 Explore, Capture, Lead in 3D

⚡️ FastVGGT: Training-Free Acceleration of
Visual Geometry Transformer