Seeing Through Clutter:
Structured 3D Scene Reconstruction via Iterative Object Removal

3DV 2026
1University of California, San Diego 2Adobe Research

Given an input image, a VLM repeatedly identifies the most prominent foreground object, reconstructs and poses it in 3D, removes it from the image, and then iterates this process until no objects remain.

Abstract

We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets

Motivation

Real-world scenes are rarely tidy. Objects overlap and occlude one another, creating clutter that poses fundamental challenges for scene understanding. The intermediate tasks that most 3D reconstruction pipelines rely on — semantic segmentation and depth estimation — degrade significantly under these conditions, producing fragmented masks and unreliable geometry.

The key insight is simple: a cluttered scene is hard to reconstruct all at once, but a clean one is not. By iteratively removing the most prominent foreground object and revisiting the scene, we transform a difficult global problem into a sequence of tractable local ones.

Motivation: cluttered vs. decluttered reconstruction

(A) In a cluttered scene, occluded objects like the table yield noisy, fragmented 3D reconstructions. (B) After removing foreground objects, the same table is fully visible, enabling a clean and accurate 3D fit.

Results

Segmentation: MIT Scene Parsing

Iterative object removal directly improves segmentation quality on the ADE20K / MIT Scene Parsing benchmark. By inpainting each removed object before segmenting the next, subsequent detections benefit from a cleaner context — particularly for partially occluded objects where the baseline (without inpainting) produces fragmented, noisy masks.

MIT Scene Parsing segmentation results

3D Reconstruction: 3D-Front

We compare against MIDI* and Gen3DSR on the 3D-Front dataset. Our method produces more complete and better-posed scene reconstructions across a range of room layouts and clutter levels.

In-the-Wild Results

We evaluate on diverse real-world images spanning outdoor street scenes, medical settings, and cluttered indoor environments. Our method generalizes well without any scene-specific fine-tuning.

Text-to-Scene Results

Our pipeline extends naturally to text-conditioned inputs. Given a text prompt, we first generate an image and then apply our iterative reconstruction pipeline to produce a structured 3D scene — no 3D supervision or additional training required.

BibTeX

@inproceedings{aguinakang2026stc,
        title={Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal},
        author={Rio Aguina-Kang and Kevin James Blackburn-Matzen and Thibault Groueix and Vladimir Kim and Matheus Gadelha},
        booktitle = {Thirteenth International Conference on 3D Vision},
        year={2026},
    }