Text-to-Scene with Large Reasoning Models

Indoor Scenes

A cozy living room of size 5 by 5 units. There is a plant on a small table in front of the L-shaped sofa.

A game room of size 5 by 7 units. There are arcade machines lined up along one of the walls. In the middle is a foosball table and a ping pong table.

A cozy living room of size 5 by 5 units. There is a plant on a small table in front of the L-shaped sofa.

A standard bedroom of size 4 by 5. It is well lit.

A nursery room for two babies. The room is of size 4 by 4 units.

An entertainment room with a pool table in the middle. The room is 5 by 5 units.

Outdoor Scenes

An abandoned gas station in the desert. There should be a station canopy and an operator shack. Add desert plants.

A big circular floating island. On top are 25 trees surrounding the island on its edge. In the middle on top of the island is a water fountain. Around this fountain unfolds a lively medieval marketplace, with a fish stand, a meat stand, a weapon stand and a vegetable stand.

A big outdoor park in nature, with a section for a kids playground.

The scene shows a cozy outdoor dining setup. The plants are arranged to form a U-shape with the oven at the base of the U. The width of the U shape is the oven width plus 6 times the length of a plant, three on each side. At the two endpoints of the U are vases. This U defines the dining area with a table and four chairs.

Bridge Composition

You need to build a bridge. You have 24 bricks at your disposal and need to cross a river of width at least 8 units.

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Method

Figure: Our model pipeline from text to scene generation.