🌍 TL;DR: FastVGGT observes strong similarity in VGGT's attention maps and leverages training-free acceleration.
🌍 TL;DR: FastVGGT observes strong similarity in VGGT's attention maps and leverages training-free acceleration.
Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, scaling these models to long-sequence inputs remains a significant challenge due to inference-time inefficiency. In this work, we present a detailed analysis of VGGT, a SOTA feed-forward visual geometry model and identify its primary bottleneck. Visualization further reveals a token collapse phenomenon in the attention maps. Motivated by these findings, we explore the potential of token merging in the feed-forward visual geometry model. Owing to the unique architectural and task-specific properties of 3D models, directly applying existing merging techniques proves challenging. To this end, we propose FastVGGT, which exploits token merging within the visual geometry model to achieve training-free acceleration. we devise a unique token partitioning strategy tailored to 3D architectures and tasks, effectively eliminating redundant computation while preserving VGGT's powerful reconstruction capacity. Extensive experiments on multiple 3D geometry benchmarks validate the effectiveness of our approach. Notably, with 1000 input images, FastVGGT achieves a 4× speedup over VGGT while mitigating error accumulation in long-sequence scenarios.
With longer input sequences, VGGT's inference speed is mainly limited by Global Attention.
We visualize VGGT's Global Attention maps on the ScanNet dataset. Each image is represented by 1,041 tokens (one camera token, four register tokens, and 1,036 patch tokens from a 28 × 37 grid). The dense self-attention mechanism generates an attention map for every token, and visualizations across tokens and blocks reveal that many of these maps are highly similar.
We mitigate redundant attention in VGGT by adopting training-free token merging. While well-established in 2D vision, its application to 3D architectures remains underexplored. Since visual geometry depends on cross-image correspondences, we propose three tailored yet simple and effective token merging strategies:
Method | 1000 | 500 | 300 | 100 | ||||
---|---|---|---|---|---|---|---|---|
CD ↓ | Time ↓ | CD ↓ | Time ↓ | CD ↓ | Time ↓ | CD ↓ | Time ↓ | |
π³ | OOM | OOM | OOM | OOM | ||||
StreamVGGT | OOM | OOM | OOM | OOM | ||||
Fast3R | 0.684 | 397.8s | 0.701 | 97.3s | 0.711 | 34.9s | 0.723 | 4.8s |
CUT3R | 0.786 | 34.8s | 0.774 | 18.8s | 0.775 | 11.1s | 0.767 | 3.6s |
VGGT* | 0.471 | 724.6s | 0.420 | 177.5s | 0.416 | 131.4s | 0.423 | 9.1s |
FastVGGT | 0.425 | 180.7s | 0.411 | 55.2s | 0.416 | 23.8s | 0.426 | 5.4s |
Notably, when processing very long sequences (e.g., 1000 images), FastVGGT not only maintains reconstruction fidelity but also significantly mitigates error accumulation.
Special thanks to Jianyuan Wang for his valuable discussions and suggestions on this work.
Thanks to these great repositories: VGGT, Dust3r, Fast3R, CUT3R, MV-DUSt3R+, StreamVGGT, VGGT-Long and many other inspiring works in the community.