Prompt Optimization Technique

OPRO (Optimization by Prompting)

What if the best prompt for a task could be discovered automatically — by the LLM itself? OPRO treats the model as its own optimizer, describing the optimization problem in natural language and iteratively refining prompts based on scored results until it finds instructions that outperform anything a human engineer wrote.

Technique Context: 2023

Introduced: OPRO (Optimization by Prompting) was published in 2023 by Yang et al. at Google DeepMind. The technique demonstrated that LLMs can serve as optimizers when the optimization task is described in natural language. Rather than using gradient-based methods to tune prompts, OPRO presents the model with previously evaluated solutions and their scores, then asks it to propose new, better solutions. The approach discovered prompts that outperformed human-designed ones by up to 50% on the Big-Bench Hard benchmark suite.

Modern LLM Status: OPRO from Google DeepMind remains influential in 2026. The meta-optimization approach — using LLMs to optimize prompts — has been integrated into tools like DSPy and MIPRO. The concept of iterative prompt refinement guided by performance metrics is now a core practice in prompt engineering. While the original paper focused on optimization via scoring, the broader principle that models can improve their own instructions has become foundational to automated prompt engineering workflows.

The Core Insight

Let the Model Optimize Itself

Traditional prompt engineering relies on human intuition: you write a prompt, test it, tweak a word, test again. This manual cycle is slow, inconsistent, and limited by whatever phrasings a human can imagine. OPRO replaces this trial-and-error loop with a systematic optimization process where the LLM itself proposes better prompts.

The key mechanism is deceptively simple. You describe the optimization task in natural language: “Here are some prompts I tried and their accuracy scores. Generate a new prompt that will score higher.” The model sees the history of what worked and what didn’t, identifies patterns in high-scoring solutions, and proposes novel prompt phrasings that combine successful elements in new ways.

Think of it like a coach reviewing game film. Instead of guessing what play to call next, the coach studies which plays scored and which failed, then designs a new play that combines the winning elements — except here, the coach and the players are the same LLM.

Why Machine-Generated Prompts Beat Human Intuition

Human prompt engineers are constrained by their vocabulary and assumptions about what “sounds right.” OPRO discovered that some of the highest-performing prompts used unexpected phrasings that no human would naturally write — like “Let’s think step by step” beating more elaborate instructions. The model explores a prompt space unconstrained by human biases about what good instructions should look like, finding solutions in the gaps between human intuition.

The OPRO Process

Four stages from optimization problem to optimal prompt

Define the Optimization Task

Describe the task you want optimized prompts for — this could be classification, reasoning, translation, or any well-defined objective. You need a scoring function that can evaluate how well a prompt performs on a representative set of examples.

Example

“I need an instruction prompt that maximizes accuracy on grade-school math word problems. I have 100 test problems with known answers to score against.”

Seed with Initial Solutions

Start with a few candidate prompts — these can be human-written baselines, simple instructions, or even random phrasings. Run each through the scoring function and record the prompt-score pairs. This initial population gives the optimizer something to learn from.

Example

Seed prompts: “Solve this math problem” (62%), “Think step by step and solve” (71%), “You are a math tutor. Show your work” (68%). These scored results become the optimizer’s initial training data.

Iterative Optimization Loop

Present the LLM with the meta-prompt: a description of the task, the history of previously tried prompts and their scores, and a request to generate a new prompt that scores higher. The model analyzes patterns in high-scoring solutions and proposes novel candidates. Each new prompt is scored and added to the history, enriching future iterations.

Example

Meta-prompt: “Below are prompts and their accuracy scores on math problems. Generate a new prompt that will achieve higher accuracy. [Previous prompts and scores listed]. New prompt:” — The model might generate: “Break this problem into smaller parts. Solve each part, then combine for the final answer.” (scored at 78%).

Select the Optimal Prompt

After a fixed number of iterations (or when scores plateau), select the highest-scoring prompt from the accumulated history. This optimized prompt can then be deployed in production. The entire optimization trajectory is preserved, providing insight into what prompt characteristics drive performance for this specific task.

Example

After 20 iterations, the top prompt scores 89%: “Read the problem carefully. Identify the known quantities and the unknown. Set up equations step by step, then solve and verify your answer.” This outperforms the best human-written baseline by 18 percentage points.

See the Difference

Why systematic optimization outperforms manual prompt writing

Human-Written Prompt

You are a helpful assistant. Please solve the following math problem. Show your reasoning step by step.

Result

Accuracy: 71% on benchmark. The prompt “sounds good” to a human but may not align with how the model actually processes instructions internally.

Limited by human intuition about what instructions should sound like

Machine-Discovered Prompt

Read the problem carefully. Identify all given quantities and what you need to find. Set up equations for each relationship, solve them in order, then verify your final answer matches the original problem constraints.

Result

Accuracy: 89% on benchmark. The optimized prompt is more specific, action-oriented, and includes a verification step — patterns the optimizer discovered through iteration rather than human guesswork.

Data-driven, iteratively refined, outperforms human baselines by up to 50%

Natural Language Works Too

While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.

OPRO in Action

See how iterative optimization discovers better prompts across domains

Optimizing a Customer Support Classifier

Optimization Setup

Task: Classify customer support tickets into categories (billing, technical, account, general).

Seed prompts and scores:
“Classify this support ticket into one category” — 64% accuracy
“Read the ticket and assign it to: billing, technical, account, or general” — 72% accuracy
“You are a support agent. Categorize this ticket” — 69% accuracy

OPRO-Discovered Prompt (Iteration 15)

“Read the customer’s message below. Identify the primary issue they need resolved. Based on the core problem — not surface keywords — assign exactly one category: billing (payment/charges/invoices), technical (bugs/errors/functionality), account (access/settings/profile), or general (everything else). Output only the category name.”

Score: 91% accuracy — The optimizer discovered that defining category boundaries explicitly and instructing the model to look past surface keywords dramatically improved classification. Always verify AI classifications against ground truth before deploying in production.

Improving Reasoning on Logic Problems

Optimization Setup

Task: Solve logical deduction puzzles from the Big-Bench Hard suite.

Iteration trajectory:
Round 1: “Solve this logic puzzle” — 38%
Round 5: “Think through this step by step before answering” — 54%
Round 10: “List all the constraints first, then test each option against every constraint” — 67%

OPRO-Discovered Prompt (Iteration 22)

“First, extract every constraint and rule stated in the problem. Number each constraint. Then, for each answer option, check it against every numbered constraint one at a time. Eliminate any option that violates even one constraint. The correct answer is the only option that satisfies all constraints.”

Score: 82% accuracy — The optimizer converged on a constraint-checking strategy that mirrors formal verification methods. Notice how the prompt evolved from vague (“solve this”) to highly structured through pure optimization pressure. Always cross-check AI-generated logic solutions independently.

Optimizing Summarization Quality

Optimization Setup

Task: Generate concise, accurate summaries of technical documents. Scored by a combination of factual accuracy, completeness, and brevity.

Seed prompts:
“Summarize this document” — 55% composite score
“Write a brief summary covering the main points” — 61% composite score

OPRO-Discovered Prompt (Iteration 18)

“Read the document completely. Identify the three most important claims or findings. For each, write one sentence stating the claim and its supporting evidence. End with one sentence on the document’s overall conclusion. Do not include background information or methodology unless it is essential to understanding a key finding.”

Score: 84% composite — The optimizer learned that constraining the summary structure (three claims + conclusion) and explicitly excluding low-value content produced tighter, more accurate summaries than open-ended instructions. Always verify that AI summaries faithfully represent the source material.

When to Use OPRO

Best for systematic prompt improvement with measurable objectives

Perfect For

Production Prompt Tuning

When you have a scoring metric and need the highest-performing prompt for a deployed system — OPRO systematically finds prompts humans would never think to write.

Benchmark Optimization

Maximizing performance on standardized evaluations where accuracy gains of even a few percentage points matter significantly.

Classification and Extraction Tasks

Structured tasks with clear right/wrong answers where prompt quality directly impacts measurable output accuracy.

Discovering Non-Obvious Instructions

Exploring the prompt space beyond human intuition — OPRO often finds effective phrasings that are counterintuitive but empirically superior.

Skip It When

One-Off Prompts

If you need a prompt for a single use, the overhead of setting up scoring and running optimization iterations is not justified.

Subjective or Creative Tasks

Tasks without a clear scoring function — creative writing, brainstorming, or open-ended conversation — cannot provide the feedback signal OPRO requires.

Token-Budget Constraints

OPRO requires many LLM calls across iterations — the optimization loop can consume significant compute resources, making it impractical for low-budget projects.

Use Cases

Where OPRO delivers the most value

Enterprise NLP Pipelines

Optimize classification, extraction, and routing prompts across production systems where even small accuracy improvements translate to significant business value.

Research Benchmarking

Discover optimal prompts for standardized evaluations, ensuring models are tested at their true capability rather than being limited by suboptimal instruction phrasing.

Automated QA Systems

Optimize grading and evaluation prompts to align model judgments with human expert consensus on quality assessment tasks.

Chatbot Intent Recognition

Iteratively optimize the system prompt for intent classification so the chatbot correctly routes user queries to the right handler with maximum precision.

Medical Data Extraction

Optimize prompts for extracting structured data from clinical notes, where accuracy is critical and the cost of errors is high.

Content Moderation

Systematically optimize safety and moderation prompts to achieve the best balance of sensitivity and specificity for harmful content detection.

Where OPRO Fits

OPRO bridges manual prompt writing and fully automated prompt engineering

Manual Prompting Human Intuition Trial-and-error prompt writing

OPRO LLM as Optimizer Score-guided iterative refinement

DSPy Programmatic Optimization Compiler-based prompt tuning

MIPRO Multi-Stage Optimization Joint instruction and example tuning

From Research to Practice

OPRO demonstrated a powerful principle: LLMs can improve their own instructions when given performance feedback. This concept has since been operationalized in production-grade tools like DSPy (which compiles natural language programs into optimized prompts) and MIPRO (which jointly optimizes instructions and demonstrations). If you are building prompt optimization into a production pipeline, consider these more mature tools that build on OPRO’s foundational insight.

Related Techniques & Frameworks

Explore the prompt optimization ecosystem

Evolution DSPy Takes OPRO’s insight further with a programming framework that compiles declarative language programs into optimized LLM pipelines automatically.

Extension MIPRO Extends prompt optimization to jointly tune instructions and few-shot examples, building on OPRO’s score-guided approach with multi-stage search.

Complement Prompt Mining While OPRO generates new prompts via optimization, Prompt Mining discovers effective prompts by analyzing patterns in large text corpora — a complementary discovery approach.

Optimize Your Prompts

Try iterative prompt optimization on your own tasks or explore complementary techniques with our tools.

Prompt Builder Technique Finder

OPRO (Optimization by Prompting)

Let the Model Optimize Itself

The OPRO Process

Define the Optimization Task

Seed with Initial Solutions

Iterative Optimization Loop

Select the Optimal Prompt

See the Difference

Manual Prompt Engineering

OPRO-Optimized

Natural Language Works Too

OPRO in Action

When to Use OPRO

Perfect For

Skip It When

Use Cases

Enterprise NLP Pipelines

Research Benchmarking

Automated QA Systems

Chatbot Intent Recognition

Medical Data Extraction

Content Moderation

Where OPRO Fits

Related Techniques & Frameworks

Optimize Your Prompts