ControlNet Prompting

Technique Context: 2023

Introduced: ControlNet was introduced by Zhang et al. in February 2023. It adds spatial conditioning to diffusion models by accepting control images — edge maps, depth maps, pose skeletons, segmentation maps — alongside text prompts. This solved a fundamental limitation of text-only prompts: describing exact spatial layout, pose, or composition in words alone is extremely difficult. A phrase like “a person with their left arm raised at a 45-degree angle” gives the model only vague geometric guidance, whereas a pose skeleton communicates exact joint positions unambiguously.

Modern Status: ControlNet is now integrated into most Stable Diffusion workflows and the concept of structural conditioning has influenced commercial models including Adobe Firefly and Midjourney. The architecture has become a standard component in professional image generation pipelines, with community-developed models supporting dozens of conditioning types. ControlNet’s influence extends beyond its original implementation — the principle that visual structure should be communicated through visual inputs rather than text descriptions is now a foundational concept in AI image generation.

The Core Insight

Structural Control Beyond Words

Text prompts describe what to generate; ControlNet inputs describe where and how to arrange it. The core insight is that some visual properties — exact pose, edge structure, depth relationships, spatial composition — are nearly impossible to specify in text but trivial to communicate through a reference image or structural map.

Consider trying to describe the exact layout of a city skyline in words: which buildings are taller, how they overlap, where the horizon sits, how the foreground relates to the background. A depth map communicates all of this spatial information in a single image. ControlNet bridges this gap by accepting a conditioning image that constrains the spatial structure while the text prompt controls style, content, and atmosphere.

Think of it like an architect’s blueprint: the structural drawing defines where walls, doors, and windows go, while a separate specification describes materials, colors, and finishes. ControlNet separates spatial structure from aesthetic content in the same way.

Why Visual Conditioning Beats Text Alone

Text prompts are inherently one-dimensional — a sequence of words describing a two-dimensional (or three-dimensional) scene. No matter how precise your language, words cannot encode exact pixel-level spatial relationships. ControlNet adds a second input channel that speaks the language of visual structure directly: edges define boundaries, depth maps encode distance, pose skeletons fix human positioning, and segmentation maps assign regions to semantic categories. Together with text, this gives the generator both the “what” and the “where.”

The ControlNet Process

Four stages from structural reference to controlled generation

1

Choose Control Type

Select the conditioning method that best matches the spatial property you need to control. Each type communicates different structural information to the model. Canny edges preserve outlines and boundaries. Depth maps encode foreground-background relationships. OpenPose skeletons define human body positioning. Segmentation maps assign semantic regions. Scribble and line art provide loose compositional guidance.

Example

For an architectural rendering where the building shape must be exact, choose Canny edge detection. For a portrait where the subject’s pose matters but not edge detail, choose OpenPose.

2

Prepare Control Image

Generate or provide the structural reference image that will guide the generation process. This can be extracted from an existing photograph using preprocessing tools (such as running a Canny edge detector on a photo), drawn by hand as a rough sketch, or created from 3D software. The control image does not need to be photorealistic — it only needs to encode the structural information relevant to your chosen control type.

Example

Take a photograph of a building, run Canny edge detection to extract the structural outline, and use that edge map as the conditioning input for generating the same building in different architectural styles.

3

Write Text Prompt

Compose the text prompt to describe the desired content, style, and quality attributes. Since the spatial structure is already handled by the control image, the text prompt can focus entirely on aesthetic and content concerns: materials, lighting, color palette, artistic style, mood, and level of detail. This separation of concerns typically produces better results than trying to describe everything in text.

Example

“A modernist glass and steel skyscraper, golden hour lighting, photorealistic, 8k, architectural photography, dramatic sky” — the text handles style while the edge map handles structure.

4

Adjust Control Strength

Balance between structural fidelity and creative freedom by adjusting the control weight parameter. At full strength (1.0), the output closely follows the conditioning image’s structure. At lower values (0.3–0.6), the model has more creative latitude to deviate from the reference while still being influenced by it. Finding the right balance depends on how strictly the output must match the control input versus how much artistic interpretation is desired.

Example

An architectural client rendering might use 0.9 strength for precise structural fidelity. A concept art piece inspired by a rough sketch might use 0.4 strength for a looser, more interpretive result.

See the Difference

How structural conditioning transforms generation control

Prompt

“A person dancing in a studio, professional photography, dynamic pose, studio lighting”

Result

Random pose selected by the model. Unpredictable body positioning. The dancer might face any direction, with arms and legs in any configuration. Each generation produces a completely different composition. No way to specify the exact pose you need.

Random pose, unpredictable composition, no spatial control

VS

Same Text + OpenPose Skeleton

“A person dancing in a studio, professional photography, dynamic pose, studio lighting” combined with an OpenPose skeleton showing exact joint positions for an arabesque pose.

Result

Exact arabesque pose preserved from the skeleton reference. Arms, legs, torso, and head positioned precisely as specified. Style, lighting, and quality follow the text prompt. Every regeneration maintains the same pose while varying other aesthetic details.

Exact pose preserved, style applied as described, consistent composition

ControlNet in Action

Real-world applications of structural conditioning

Architectural Rendering

Control Input

A Canny edge map extracted from a building outline — capturing the precise contours of rooflines, windows, doors, and structural elements without any texture or color information. The edge map preserves the exact proportions and spatial relationships of the architectural form.

Text Prompt

“A contemporary residential building, warm brick facade, large floor-to-ceiling windows, surrounded by mature oak trees, golden hour sunlight, photorealistic architectural visualization, 8k resolution”

Result

The generated image preserves the exact building shape, window placement, and proportions from the edge map while applying the warm brick material, surrounding landscape, and golden-hour lighting described in the text. The same edge map can be reused with different text prompts to visualize the building in various material finishes, seasons, or lighting conditions — all maintaining identical structural geometry.

Character Consistency

Control Input

A series of OpenPose skeleton keyframes showing a character in different action positions: standing, walking, sitting, and reaching. Each skeleton defines the exact joint positions for that particular frame of the sequence.

Text Prompt

“A young woman in a red jacket and dark jeans, short brown hair, neutral background, consistent character design, illustration style, clean lines”

Result

Each generated image shows the same character description applied to a different body position. The pose skeletons ensure the character’s positioning is exactly controlled across all frames, while the consistent text prompt maintains visual identity. This technique is essential for storyboarding, animation previsualization, and character design sheets where the same character must appear in multiple poses.

Depth-Guided Landscapes

Control Input

A depth map showing clear foreground-background separation: bright values in the foreground (flowers and rocks), medium values in the midground (a lake and trees), and dark values in the background (distant mountains and sky). The depth gradient controls the spatial layering of the entire scene.

Text Prompt

“A serene mountain landscape at sunrise, wildflowers in the foreground, crystal-clear alpine lake in the middle distance, snow-capped peaks in the background, atmospheric perspective, landscape photography, dramatic lighting”

Result

The depth map ensures that foreground elements are sharply defined and placed correctly in the near field, the lake occupies the precise midground region, and mountains recede into the background with appropriate atmospheric haze. Without depth conditioning, the model might place mountains in the foreground or flatten the entire scene. The depth map guarantees correct spatial layering regardless of how the model interprets the text.

When to Use ControlNet

Best for scenarios requiring precise spatial control over generated images

Perfect For

Precise Spatial Control

When the exact position, shape, or layout of elements in the generated image must match a specific reference — not just “approximately right” but structurally accurate.

Character Pose Consistency

Maintaining identical body positioning across multiple generations — essential for animation keyframes, storyboards, and character design sheets.

Architectural Visualization

Generating photorealistic renderings from structural plans while preserving exact building geometry, window placement, and proportions.

Reproducing Specific Compositions

When you have a reference image whose spatial composition must be preserved but rendered in a completely different style, medium, or context.

Skip It When

Free-Form Creative Exploration

When the goal is to let the model surprise you with unexpected compositions — ControlNet constrains the output, which is counterproductive for open-ended exploration.

No Reference Structure Exists

If you do not have a reference image, sketch, or 3D model to derive a control input from, there is nothing to condition on — text-only prompting is the appropriate approach.

Model Does Not Support ControlNet

Not all image generation platforms support ControlNet inputs. Commercial APIs like DALL-E and basic Midjourney do not currently accept control images in the ControlNet format.

Use Cases

Where ControlNet delivers the most value

Fashion Design Prototyping

Use pose skeletons to position models in standard fashion poses, then iterate on garment designs, fabrics, and color palettes through text prompts while maintaining identical body positioning across all variations.

Animation Keyframing

Extract pose skeletons from keyframes or create them manually, then generate fully rendered frames for each key position. Interpolation between keyframes maintains character consistency across the entire animation sequence.

Interior Design Visualization

Use depth maps or segmentation maps from existing room photographs to generate design variations. The spatial layout — wall positions, ceiling height, window locations — stays fixed while furniture, materials, and color schemes change through text prompts.

Product Placement Mockups

Extract edge maps or depth maps from product photography scenes, then swap products into pre-composed environments. The scene composition remains stable while different products or packaging designs are tested in context.

Medical Illustration from Scans

Convert medical imaging data into structural conditioning inputs, then generate clear educational illustrations that preserve anatomically accurate spatial relationships while making complex structures visually accessible for patients and students.

Game Asset Generation

Use depth maps or segmentation masks from 3D scene layouts to generate textured environment concepts. Level designers can define spatial structure in a 3D tool and use ControlNet to rapidly explore visual styles for each environment zone.

Where ControlNet Fits

ControlNet bridges text-only generation and fully guided multi-modal pipelines

Text-to-Image Text Only Prompt describes everything

Negative Prompting Text Refinement Exclude unwanted elements

ControlNet Structural Conditioning Visual inputs guide spatial layout

Multi-Modal Guided Combined Modalities Text + image + 3D + audio signals

Combining Control Types

Advanced workflows stack multiple ControlNet models simultaneously. A single generation can be conditioned on both a depth map (for spatial layering) and a pose skeleton (for character positioning), with each control type influencing a different aspect of the final output. This multi-control approach provides granular command over complex scenes that would be impossible to describe in text alone, bringing AI image generation closer to the level of control available in traditional 3D rendering pipelines.

Related Techniques

Explore complementary image generation techniques

Foundation Image Generation Prompting The foundational text-to-image prompting techniques that ControlNet extends with structural conditioning inputs for precise spatial control.

Complement Inpainting Prompting While ControlNet conditions the entire image on structural inputs, inpainting selectively regenerates masked regions — a complementary approach for targeted edits within existing images.

Complement Image-to-Image Prompting Image-to-Image uses a reference image as a holistic starting point, blending content and structure together. ControlNet offers more precise structural control by separating spatial conditioning from content generation.

Control Your Generation

Explore structural conditioning techniques or build image generation prompts with our tools.

Prompt Builder All Foundations

ControlNet Prompting

Structural Control Beyond Words

The ControlNet Process

Choose Control Type

Prepare Control Image

Write Text Prompt

Adjust Control Strength

See the Difference

Text-Only Prompt

Text + ControlNet

ControlNet in Action

When to Use ControlNet

Perfect For

Skip It When

Use Cases

Fashion Design Prototyping

Animation Keyframing

Interior Design Visualization

Product Placement Mockups

Medical Illustration from Scans

Game Asset Generation

Where ControlNet Fits

Related Techniques

Control Your Generation