OPRO (Optimization by Prompting)
What if the best prompt for a task could be discovered automatically — by the LLM itself? OPRO treats the model as its own optimizer, describing the optimization problem in natural language and iteratively refining prompts based on scored results until it finds instructions that outperform anything a human engineer wrote.
Introduced: OPRO (Optimization by Prompting) was published in 2023 by Yang et al. at Google DeepMind. The technique demonstrated that LLMs can serve as optimizers when the optimization task is described in natural language. Rather than using gradient-based methods to tune prompts, OPRO presents the model with previously evaluated solutions and their scores, then asks it to propose new, better solutions. The approach discovered prompts that outperformed human-designed ones by up to 50% on the Big-Bench Hard benchmark suite.
Modern LLM Status: OPRO from Google DeepMind remains influential in 2026. The meta-optimization approach — using LLMs to optimize prompts — has been integrated into tools like DSPy and MIPRO. The concept of iterative prompt refinement guided by performance metrics is now a core practice in prompt engineering. While the original paper focused on optimization via scoring, the broader principle that models can improve their own instructions has become foundational to automated prompt engineering workflows.
Let the Model Optimize Itself
Traditional prompt engineering relies on human intuition: you write a prompt, test it, tweak a word, test again. This manual cycle is slow, inconsistent, and limited by whatever phrasings a human can imagine. OPRO replaces this trial-and-error loop with a systematic optimization process where the LLM itself proposes better prompts.
The key mechanism is deceptively simple. You describe the optimization task in natural language: “Here are some prompts I tried and their accuracy scores. Generate a new prompt that will score higher.” The model sees the history of what worked and what didn’t, identifies patterns in high-scoring solutions, and proposes novel prompt phrasings that combine successful elements in new ways.
Think of it like a coach reviewing game film. Instead of guessing what play to call next, the coach studies which plays scored and which failed, then designs a new play that combines the winning elements — except here, the coach and the players are the same LLM.
Human prompt engineers are constrained by their vocabulary and assumptions about what “sounds right.” OPRO discovered that some of the highest-performing prompts used unexpected phrasings that no human would naturally write — like “Let’s think step by step” beating more elaborate instructions. The model explores a prompt space unconstrained by human biases about what good instructions should look like, finding solutions in the gaps between human intuition.
The OPRO Process
Four stages from optimization problem to optimal prompt
Define the Optimization Task
Describe the task you want optimized prompts for — this could be classification, reasoning, translation, or any well-defined objective. You need a scoring function that can evaluate how well a prompt performs on a representative set of examples.
“I need an instruction prompt that maximizes accuracy on grade-school math word problems. I have 100 test problems with known answers to score against.”
Seed with Initial Solutions
Start with a few candidate prompts — these can be human-written baselines, simple instructions, or even random phrasings. Run each through the scoring function and record the prompt-score pairs. This initial population gives the optimizer something to learn from.
Seed prompts: “Solve this math problem” (62%), “Think step by step and solve” (71%), “You are a math tutor. Show your work” (68%). These scored results become the optimizer’s initial training data.
Iterative Optimization Loop
Present the LLM with the meta-prompt: a description of the task, the history of previously tried prompts and their scores, and a request to generate a new prompt that scores higher. The model analyzes patterns in high-scoring solutions and proposes novel candidates. Each new prompt is scored and added to the history, enriching future iterations.
Meta-prompt: “Below are prompts and their accuracy scores on math problems. Generate a new prompt that will achieve higher accuracy. [Previous prompts and scores listed]. New prompt:” — The model might generate: “Break this problem into smaller parts. Solve each part, then combine for the final answer.” (scored at 78%).
Select the Optimal Prompt
After a fixed number of iterations (or when scores plateau), select the highest-scoring prompt from the accumulated history. This optimized prompt can then be deployed in production. The entire optimization trajectory is preserved, providing insight into what prompt characteristics drive performance for this specific task.
After 20 iterations, the top prompt scores 89%: “Read the problem carefully. Identify the known quantities and the unknown. Set up equations step by step, then solve and verify your answer.” This outperforms the best human-written baseline by 18 percentage points.
See the Difference
Why systematic optimization outperforms manual prompt writing
Manual Prompt Engineering
You are a helpful assistant. Please solve the following math problem. Show your reasoning step by step.
Accuracy: 71% on benchmark. The prompt “sounds good” to a human but may not align with how the model actually processes instructions internally.
OPRO-Optimized
Read the problem carefully. Identify all given quantities and what you need to find. Set up equations for each relationship, solve them in order, then verify your final answer matches the original problem constraints.
Accuracy: 89% on benchmark. The optimized prompt is more specific, action-oriented, and includes a verification step — patterns the optimizer discovered through iteration rather than human guesswork.
Natural Language Works Too
While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.
OPRO in Action
See how iterative optimization discovers better prompts across domains
Task: Classify customer support tickets into categories (billing, technical, account, general).
Seed prompts and scores:
“Classify this support ticket into one category” — 64% accuracy
“Read the ticket and assign it to: billing, technical, account, or general” — 72% accuracy
“You are a support agent. Categorize this ticket” — 69% accuracy
“Read the customer’s message below. Identify the primary issue they need resolved. Based on the core problem — not surface keywords — assign exactly one category: billing (payment/charges/invoices), technical (bugs/errors/functionality), account (access/settings/profile), or general (everything else). Output only the category name.”
Score: 91% accuracy — The optimizer discovered that defining category boundaries explicitly and instructing the model to look past surface keywords dramatically improved classification. Always verify AI classifications against ground truth before deploying in production.
Task: Solve logical deduction puzzles from the Big-Bench Hard suite.
Iteration trajectory:
Round 1: “Solve this logic puzzle” — 38%
Round 5: “Think through this step by step before answering” — 54%
Round 10: “List all the constraints first, then test each option against every constraint” — 67%
“First, extract every constraint and rule stated in the problem. Number each constraint. Then, for each answer option, check it against every numbered constraint one at a time. Eliminate any option that violates even one constraint. The correct answer is the only option that satisfies all constraints.”
Score: 82% accuracy — The optimizer converged on a constraint-checking strategy that mirrors formal verification methods. Notice how the prompt evolved from vague (“solve this”) to highly structured through pure optimization pressure. Always cross-check AI-generated logic solutions independently.
Task: Generate concise, accurate summaries of technical documents. Scored by a combination of factual accuracy, completeness, and brevity.
Seed prompts:
“Summarize this document” — 55% composite score
“Write a brief summary covering the main points” — 61% composite score
“Read the document completely. Identify the three most important claims or findings. For each, write one sentence stating the claim and its supporting evidence. End with one sentence on the document’s overall conclusion. Do not include background information or methodology unless it is essential to understanding a key finding.”
Score: 84% composite — The optimizer learned that constraining the summary structure (three claims + conclusion) and explicitly excluding low-value content produced tighter, more accurate summaries than open-ended instructions. Always verify that AI summaries faithfully represent the source material.
When to Use OPRO
Best for systematic prompt improvement with measurable objectives
Perfect For
When you have a scoring metric and need the highest-performing prompt for a deployed system — OPRO systematically finds prompts humans would never think to write.
Maximizing performance on standardized evaluations where accuracy gains of even a few percentage points matter significantly.
Structured tasks with clear right/wrong answers where prompt quality directly impacts measurable output accuracy.
Exploring the prompt space beyond human intuition — OPRO often finds effective phrasings that are counterintuitive but empirically superior.
Skip It When
If you need a prompt for a single use, the overhead of setting up scoring and running optimization iterations is not justified.
Tasks without a clear scoring function — creative writing, brainstorming, or open-ended conversation — cannot provide the feedback signal OPRO requires.
OPRO requires many LLM calls across iterations — the optimization loop can consume significant compute resources, making it impractical for low-budget projects.
Use Cases
Where OPRO delivers the most value
Enterprise NLP Pipelines
Optimize classification, extraction, and routing prompts across production systems where even small accuracy improvements translate to significant business value.
Research Benchmarking
Discover optimal prompts for standardized evaluations, ensuring models are tested at their true capability rather than being limited by suboptimal instruction phrasing.
Automated QA Systems
Optimize grading and evaluation prompts to align model judgments with human expert consensus on quality assessment tasks.
Chatbot Intent Recognition
Iteratively optimize the system prompt for intent classification so the chatbot correctly routes user queries to the right handler with maximum precision.
Medical Data Extraction
Optimize prompts for extracting structured data from clinical notes, where accuracy is critical and the cost of errors is high.
Content Moderation
Systematically optimize safety and moderation prompts to achieve the best balance of sensitivity and specificity for harmful content detection.
Where OPRO Fits
OPRO bridges manual prompt writing and fully automated prompt engineering
OPRO demonstrated a powerful principle: LLMs can improve their own instructions when given performance feedback. This concept has since been operationalized in production-grade tools like DSPy (which compiles natural language programs into optimized prompts) and MIPRO (which jointly optimizes instructions and demonstrations). If you are building prompt optimization into a production pipeline, consider these more mature tools that build on OPRO’s foundational insight.
Related Techniques & Frameworks
Explore the prompt optimization ecosystem
Optimize Your Prompts
Try iterative prompt optimization on your own tasks or explore complementary techniques with our tools.