Modern text-to-image systems built on diffusion or flow-matching models typically produce images in a single, unified generation process—refining noise step by step, but still treating the scene as one holistic output. New research suggests that stronger results emerge when image creation is broken into explicit stages of planning, generation, verification, and correction.
What’s new: Lei Zhang and collaborators from Meta, the University of California San Diego, Worcester Polytechnic Institute, and Northwestern University propose a fine-tuning approach that teaches image generators to construct visuals incrementally. Instead of producing an image in one pass, the model learns to iteratively plan an element, generate it, evaluate whether it matches the prompt, and revise or extend the scene before moving on to the next component.
Key insight: A persistent weakness in text-to-image systems is spatial and compositional reasoning. Models often struggle with relationships such as “above,” “behind,” or “in front of,” and with attribute consistency—such as accurately rendering the correct number of limbs, fingers, or objects.
The proposed method reframes image generation as a controlled loop, where the model builds a scene piece by piece. Given a prompt like “a bear hovering above a silver spoon,” the process unfolds in stages:
Plan: The model outlines the next modification and predicts the intermediate scene (e.g., first introduce the bear, then position a spoon beneath it).
Sketch: A partial image is generated reflecting the current stage of construction.
Inspect: The system evaluates whether the instruction, intermediate description, and image align with the original prompt.
Refine: If inconsistencies appear, the model issues corrective instructions (for example, repositioning the spoon under the bear rather than in front of it) and regenerates the updated scene.
By training on examples structured in this format, the model learns not only to generate images, but to progressively assemble and self-correct them.
How it works: The authors build on BAGEL-7B, a pretrained multimodal model capable of processing both images and text and outputting updated images alongside textual descriptions of modifications. They fine-tune it so that image generation becomes a cyclical process of planning, sketching, inspecting, and refining.
Fine-tuning for planning and sketching: To train the first stages, the researchers constructed a dataset of 32,000 examples, each containing several intermediate images and a final output. They used GPT-4o to convert prompts into structured scene graphs, representing objects (such as “cat” or “bear”), attributes (“furry”), and relationships (“cat is furry”). Subsets of these graphs were then expanded into stepwise prompts describing incremental additions to a scene.
Each step was rendered using FLUX.1 Kontext, producing a sequence of images that gradually built toward the final composition. Only examples judged consistent with the prompt by GPT-4o were retained. The model was then trained to predict both the next textual instruction and the corresponding image updates via flow-matching steps.
Fine-tuning for inspection: To train the evaluation stage, the partially trained model was used to generate intermediate outputs. GPT-4o then assessed whether intermediate descriptions and images conflicted with the original prompt. It produced both critiques and corrective instructions, resulting in a dataset of roughly 7,000 consistent examples and 8,300 inconsistent ones. Learning to replicate these judgments enabled the model to distinguish between valid and flawed intermediate compositions.
Fine-tuning for refinement: A third dataset focused on improvement cycles, pairing images with textual feedback on how they could be enhanced, along with improved versions of those images. The model learned to interpret critique and translate it into visual corrections.
All three training stages were then combined in a unified fine-tuning process using consistent loss functions.
Results: The staged-generation approach significantly improved BAGEL-7B’s ability to produce images that accurately match textual prompts, particularly in scenarios requiring precise spatial relationships and object interactions.
On GenEval, which measures how completely generated images reflect prompt details, performance rose from 77% to 83% after training on 62,000 examples using 131 flow-matching steps. By comparison, PARM—a method relying on iterative critique of diffusion states—reached 77% despite using 688,000 examples and 1,000 flow-matching steps.
On WISE, which evaluates realism, aesthetics, and prompt consistency, scores improved from 0.70 to 0.76. The model also showed stronger alignment with contextual constraints, more accurately placing scenes in correct historical or temporal settings. In domain-specific tests, it generated more chemically plausible laboratory scenes and structures.
Why it matters: Text-to-image models often produce visually convincing results that subtly violate the prompt’s instructions. This work suggests that improving reliability may depend less on scaling model size and more on restructuring the generation process itself—embedding verification and correction directly into how images are built.
We’re thinking: The approach mirrors reasoning strategies in language models. Just as step-by-step deliberation improves complex text outputs, staged visual construction appears to help models reason about space, relationships, and consistency—turning image generation into something closer to iterative problem-solving than one-shot creation.



