ControlNet Prompting
Use structural conditioning inputs — edge maps, depth maps, pose skeletons, and segmentation masks — to precisely control AI image generation beyond what text prompts alone can achieve.
Introduced: ControlNet was introduced by Zhang et al. in February 2023. It adds spatial conditioning to diffusion models by accepting control images — edge maps, depth maps, pose skeletons, segmentation maps — alongside text prompts. This solved a fundamental limitation of text-only prompts: describing exact spatial layout, pose, or composition in words alone is extremely difficult. A phrase like “a person with their left arm raised at a 45-degree angle” gives the model only vague geometric guidance, whereas a pose skeleton communicates exact joint positions unambiguously.
Modern Status: ControlNet is now integrated into most Stable Diffusion workflows and the concept of structural conditioning has influenced commercial models including Adobe Firefly and Midjourney. The architecture has become a standard component in professional image generation pipelines, with community-developed models supporting dozens of conditioning types. ControlNet’s influence extends beyond its original implementation — the principle that visual structure should be communicated through visual inputs rather than text descriptions is now a foundational concept in AI image generation.
Structural Control Beyond Words
Text prompts describe what to generate; ControlNet inputs describe where and how to arrange it. The core insight is that some visual properties — exact pose, edge structure, depth relationships, spatial composition — are nearly impossible to specify in text but trivial to communicate through a reference image or structural map.
Consider trying to describe the exact layout of a city skyline in words: which buildings are taller, how they overlap, where the horizon sits, how the foreground relates to the background. A depth map communicates all of this spatial information in a single image. ControlNet bridges this gap by accepting a conditioning image that constrains the spatial structure while the text prompt controls style, content, and atmosphere.
Think of it like an architect’s blueprint: the structural drawing defines where walls, doors, and windows go, while a separate specification describes materials, colors, and finishes. ControlNet separates spatial structure from aesthetic content in the same way.
Text prompts are inherently one-dimensional — a sequence of words describing a two-dimensional (or three-dimensional) scene. No matter how precise your language, words cannot encode exact pixel-level spatial relationships. ControlNet adds a second input channel that speaks the language of visual structure directly: edges define boundaries, depth maps encode distance, pose skeletons fix human positioning, and segmentation maps assign regions to semantic categories. Together with text, this gives the generator both the “what” and the “where.”
The ControlNet Process
Four stages from structural reference to controlled generation
Choose Control Type
Select the conditioning method that best matches the spatial property you need to control. Each type communicates different structural information to the model. Canny edges preserve outlines and boundaries. Depth maps encode foreground-background relationships. OpenPose skeletons define human body positioning. Segmentation maps assign semantic regions. Scribble and line art provide loose compositional guidance.
For an architectural rendering where the building shape must be exact, choose Canny edge detection. For a portrait where the subject’s pose matters but not edge detail, choose OpenPose.
Prepare Control Image
Generate or provide the structural reference image that will guide the generation process. This can be extracted from an existing photograph using preprocessing tools (such as running a Canny edge detector on a photo), drawn by hand as a rough sketch, or created from 3D software. The control image does not need to be photorealistic — it only needs to encode the structural information relevant to your chosen control type.
Take a photograph of a building, run Canny edge detection to extract the structural outline, and use that edge map as the conditioning input for generating the same building in different architectural styles.
Write Text Prompt
Compose the text prompt to describe the desired content, style, and quality attributes. Since the spatial structure is already handled by the control image, the text prompt can focus entirely on aesthetic and content concerns: materials, lighting, color palette, artistic style, mood, and level of detail. This separation of concerns typically produces better results than trying to describe everything in text.
“A modernist glass and steel skyscraper, golden hour lighting, photorealistic, 8k, architectural photography, dramatic sky” — the text handles style while the edge map handles structure.
Adjust Control Strength
Balance between structural fidelity and creative freedom by adjusting the control weight parameter. At full strength (1.0), the output closely follows the conditioning image’s structure. At lower values (0.3–0.6), the model has more creative latitude to deviate from the reference while still being influenced by it. Finding the right balance depends on how strictly the output must match the control input versus how much artistic interpretation is desired.
An architectural client rendering might use 0.9 strength for precise structural fidelity. A concept art piece inspired by a rough sketch might use 0.4 strength for a looser, more interpretive result.
See the Difference
How structural conditioning transforms generation control
Text-Only Prompt
“A person dancing in a studio, professional photography, dynamic pose, studio lighting”
Random pose selected by the model. Unpredictable body positioning. The dancer might face any direction, with arms and legs in any configuration. Each generation produces a completely different composition. No way to specify the exact pose you need.
Text + ControlNet
“A person dancing in a studio, professional photography, dynamic pose, studio lighting” combined with an OpenPose skeleton showing exact joint positions for an arabesque pose.
Exact arabesque pose preserved from the skeleton reference. Arms, legs, torso, and head positioned precisely as specified. Style, lighting, and quality follow the text prompt. Every regeneration maintains the same pose while varying other aesthetic details.
ControlNet in Action
Real-world applications of structural conditioning
A Canny edge map extracted from a building outline — capturing the precise contours of rooflines, windows, doors, and structural elements without any texture or color information. The edge map preserves the exact proportions and spatial relationships of the architectural form.
“A contemporary residential building, warm brick facade, large floor-to-ceiling windows, surrounded by mature oak trees, golden hour sunlight, photorealistic architectural visualization, 8k resolution”
The generated image preserves the exact building shape, window placement, and proportions from the edge map while applying the warm brick material, surrounding landscape, and golden-hour lighting described in the text. The same edge map can be reused with different text prompts to visualize the building in various material finishes, seasons, or lighting conditions — all maintaining identical structural geometry.
A series of OpenPose skeleton keyframes showing a character in different action positions: standing, walking, sitting, and reaching. Each skeleton defines the exact joint positions for that particular frame of the sequence.
“A young woman in a red jacket and dark jeans, short brown hair, neutral background, consistent character design, illustration style, clean lines”
Each generated image shows the same character description applied to a different body position. The pose skeletons ensure the character’s positioning is exactly controlled across all frames, while the consistent text prompt maintains visual identity. This technique is essential for storyboarding, animation previsualization, and character design sheets where the same character must appear in multiple poses.
A depth map showing clear foreground-background separation: bright values in the foreground (flowers and rocks), medium values in the midground (a lake and trees), and dark values in the background (distant mountains and sky). The depth gradient controls the spatial layering of the entire scene.
“A serene mountain landscape at sunrise, wildflowers in the foreground, crystal-clear alpine lake in the middle distance, snow-capped peaks in the background, atmospheric perspective, landscape photography, dramatic lighting”
The depth map ensures that foreground elements are sharply defined and placed correctly in the near field, the lake occupies the precise midground region, and mountains recede into the background with appropriate atmospheric haze. Without depth conditioning, the model might place mountains in the foreground or flatten the entire scene. The depth map guarantees correct spatial layering regardless of how the model interprets the text.
When to Use ControlNet
Best for scenarios requiring precise spatial control over generated images
Perfect For
When the exact position, shape, or layout of elements in the generated image must match a specific reference — not just “approximately right” but structurally accurate.
Maintaining identical body positioning across multiple generations — essential for animation keyframes, storyboards, and character design sheets.
Generating photorealistic renderings from structural plans while preserving exact building geometry, window placement, and proportions.
When you have a reference image whose spatial composition must be preserved but rendered in a completely different style, medium, or context.
Skip It When
When the goal is to let the model surprise you with unexpected compositions — ControlNet constrains the output, which is counterproductive for open-ended exploration.
If you do not have a reference image, sketch, or 3D model to derive a control input from, there is nothing to condition on — text-only prompting is the appropriate approach.
Not all image generation platforms support ControlNet inputs. Commercial APIs like DALL-E and basic Midjourney do not currently accept control images in the ControlNet format.
Use Cases
Where ControlNet delivers the most value
Fashion Design Prototyping
Use pose skeletons to position models in standard fashion poses, then iterate on garment designs, fabrics, and color palettes through text prompts while maintaining identical body positioning across all variations.
Animation Keyframing
Extract pose skeletons from keyframes or create them manually, then generate fully rendered frames for each key position. Interpolation between keyframes maintains character consistency across the entire animation sequence.
Interior Design Visualization
Use depth maps or segmentation maps from existing room photographs to generate design variations. The spatial layout — wall positions, ceiling height, window locations — stays fixed while furniture, materials, and color schemes change through text prompts.
Product Placement Mockups
Extract edge maps or depth maps from product photography scenes, then swap products into pre-composed environments. The scene composition remains stable while different products or packaging designs are tested in context.
Medical Illustration from Scans
Convert medical imaging data into structural conditioning inputs, then generate clear educational illustrations that preserve anatomically accurate spatial relationships while making complex structures visually accessible for patients and students.
Game Asset Generation
Use depth maps or segmentation masks from 3D scene layouts to generate textured environment concepts. Level designers can define spatial structure in a 3D tool and use ControlNet to rapidly explore visual styles for each environment zone.
Where ControlNet Fits
ControlNet bridges text-only generation and fully guided multi-modal pipelines
Advanced workflows stack multiple ControlNet models simultaneously. A single generation can be conditioned on both a depth map (for spatial layering) and a pose skeleton (for character positioning), with each control type influencing a different aspect of the final output. This multi-control approach provides granular command over complex scenes that would be impossible to describe in text alone, bringing AI image generation closer to the level of control available in traditional 3D rendering pipelines.
Related Techniques
Explore complementary image generation techniques
Control Your Generation
Explore structural conditioning techniques or build image generation prompts with our tools.