Learning 3D Scene Analogies with Neural Contextual Scene Maps

Abstract

Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications.

Overview Video

Teaser Video

Task Overview

We propose a new task of finding 3D scene analogies. Given two scenes containing regions with similar scene contexts, 3D scene analogies are defined as dense mappings between the regions.

Why 3D Scene Analogies?

Scene contexts, which are defined by relations between objects and open spaces, are often hard to explicitly model in detail. We instead have machines search for analogies, which can implicitly capture the structural commonalities in 3D scenes.

Where can we use 3D Scene Analogies?

Scene analogies are useful for tasks requiring fine-grained 3D transfer. One can transfer motion trajectories while preserving the contextual information. Alternatively, object placements can be transferred: here objects from one scene are mapped to another using the dense maps.

Neural Contextual Scene Maps

Given a pair of scenes expressed as sparse keypoints, neural contextual scene maps finds a 3D scene analogy that connects the region of interest to the corresponding region in the reference scene.

Step 1: Descriptor Field Extraction

Descriptor fields are defined for arbitrary points in 3D space as the aggregation of distances and semantic information of points within a designated radius.

Step 2: Coarse-to-Fine Field Estimation

Given the descriptor fields, we estimate scene maps through a coarse-to-fine process of first estimating affine maps and finding local displacements.

Qualitative Results (Synthetic Scenes)

Our method can find good mappings that make detail-preserving maps both for open spaces and points near the object surface.

Qualitative Results (Real Scenes)

Our method can also handle noisy 3D scans for both open space and near surface points. Further, our method can also handle sim2real cases, matching regions between synthetic and noisy 3D scans.

Applications: Long Trajectory Transfer

3D scene analogies can be used for long trajectory transfer by first mapping a sparse set of waypoints and applying classical path planning to interpolate between the mapped waypoints.

BibTeX

@InProceedings{Kim_2025_ICCV,
  author    = {Kim, Junho and Bae, Gwangtak and Lee, Eun Sun and Kim, Young Min},
  title     = {Learning 3D Scene Analogies with Neural Contextual Scene Maps},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2025},
}