Text-to-Scene with Large Reasoning Models

Indoor Scenes

A cozy living room of size 5 by 5 units. There is a plant on a small table in front of the L-shaped sofa.
A game room of size 5 by 7 units. There are arcade machines lined up along one of the walls. In the middle is a foosball table and a ping pong table.
A cozy living room of size 5 by 5 units. There is a plant on a small table in front of the L-shaped sofa.
A standard bedroom of size 4 by 5. It is well lit.
A nursery room for two babies. The room is of size 4 by 4 units.
An entertainment room with a pool table in the middle. The room is 5 by 5 units.

Outdoor Scenes

Bridge Composition

You need to build a bridge. You have 24 bricks at your disposal and need to cross a river of width at least 8 units.

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Method

Pipeline Diagram
Figure: Our model pipeline from text to scene generation.