⚡️ FastVGGT: Training-Free Acceleration of
Visual Geometry Transformer

Maclab Logo Autolab Logo
You Shen1,     Zhipeng Zhang2,     Yansong Qu1,     Liujuan Cao1

🌍 TL;DR: FastVGGT observes strong similarity in VGGT's attention maps and leverages training-free acceleration.

FastVGGT Overview

Abstract

Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a SOTA feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which exploits token merging within the visual geometry model to achieve training-free acceleration. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4× speedup over VGGT while mitigating error accumulation in long-sequence scenarios.


🤏 Bottleneck

With longer input sequences, VGGT's inference speed is mainly limited by Global Attention.

Architecture

💎 Observation

We visualize VGGT's Global Attention maps on the ScanNet dataset. Each image is represented by 1,041 tokens (one camera token, four register tokens, and 1,036 patch tokens from a 28 × 37 grid). The dense self-attention mechanism generates an attention map for every token, and visualizations across tokens and blocks reveal that many of these maps are highly similar.

obs-0
obs-1
obs-2
obs-3

Method

We mitigate redundant attention in VGGT by adopting training-free token merging. While well-established in 2D vision, its application to 3D architectures remains underexplored. Since visual geometry depends on cross-image correspondences, we propose three tailored yet simple and effective token merging strategies:

  1. Tokens from the initial frame, which serves as the global reference for the entire scene, are designated as high-priority dst tokens and are exempt from being merged to ensure reconstruction stability.
  2. To maintain global consistency and preserve fine-grained details, we identify and retain the most salient tokens across all frames, allowing them to bypass the merging process entirely and participate directly in the attention computation.
  3. Drawing inspiration from ToMeSD, we implement region-based random sampling within each subsequent frame. This ensures a spatially balanced selection of src and dst tokens, preventing critical information loss in localized regions during consolidation.
Architecture

Quantitative Comparison

FastVGGT delivers substantial acceleration over baseline VGGT across all settings while preserving reconstruction accuracy.
Method 1000 500 300 100
CD ↓ Time ↓ CD ↓ Time ↓ CD ↓ Time ↓ CD ↓ Time ↓
π³ OOM OOM OOM OOM
StreamVGGT OOM OOM OOM OOM
Fast3R 0.684 397.8s 0.701 97.3s 0.711 34.9s 0.723 4.8s
CUT3R 0.786 34.8s 0.774 18.8s 0.775 11.1s 0.767 3.6s
VGGT* 0.471 724.6s 0.420 177.5s 0.416 131.4s 0.423 9.1s
FastVGGT 0.425 180.7s 0.411 55.2s 0.416 23.8s 0.426 5.4s

Qualitative Visualization

Notably, when processing very long sequences (e.g., 1000 images), FastVGGT not only maintains reconstruction fidelity but also significantly mitigates error accumulation.

FastVGGT
VGGT
Visualization 2

Acknowledgements

Special thanks to Jianyuan Wang for his valuable discussions and suggestions on this work.

Thanks to these great repositories: VGGT, Dust3r, Fast3R, CUT3R, MV-DUSt3R+, StreamVGGT, VGGT-Long and many other inspiring works in the community.