APE (Automatic Prompt Engineer)
Why write prompts yourself when the model can write better ones? APE uses LLMs to propose candidate instructions, evaluates each on a target task, and selects the best-performing prompt through iterative search — discovering that machines often engineer more effective prompts than humans.
Introduced: APE was published in 2022 by Zhou et al. The paper demonstrated a striking finding: LLMs could generate prompt instructions that outperformed carefully human-crafted ones. Most famously, APE discovered that “Let’s work this out in a step by step way to be sure we have the right answer” significantly outperformed the original “Let’s think step by step” zero-shot chain-of-thought prompt. The technique framed prompt engineering as a program synthesis problem — search through the space of possible instructions to find the one that maximizes task performance.
Modern LLM Status: APE was a landmark paper showing that LLMs could engineer better prompts than humans. The principle has been absorbed into production tools like DSPy and OPRO in 2026. The standalone technique is less common now, but the insight — automated prompt search — is foundational to modern prompt optimization pipelines. Understanding APE helps practitioners grasp why tools like DSPy work and when automated optimization outperforms manual prompt engineering.
Prompt Engineering as Search
Manual prompt engineering is a form of trial and error. You write an instruction, test it, notice failures, revise it, and repeat. This process is slow, biased by your assumptions about what makes a good prompt, and limited by your creativity in imagining alternative phrasings. APE automates this entire loop by treating it as what it fundamentally is: a search problem.
APE uses LLMs to explore the instruction space systematically. Given a task description or examples, it generates many candidate instructions — not just one human-written attempt. Each candidate is evaluated on a held-out set of examples, scored by accuracy, and the top performers are selected. Optional iterative refinement generates variations of the best candidates for further evaluation, progressively narrowing toward the optimal instruction.
Think of it like A/B testing at scale for prompts. Instead of testing your one idea against a control, you generate dozens of candidates and let data decide which performs best — often discovering phrasings you would never have thought to try.
APE’s most celebrated finding was that the model-generated instruction “Let’s work this out in a step by step way to be sure we have the right answer” outperformed the famous human-written “Let’s think step by step” by a significant margin on reasoning benchmarks. The difference? The APE version adds a quality motivation (“to be sure we have the right answer”) that subtly encourages the model to be more careful. No human prompt engineer had thought to add that clause — the machine found it through search.
The APE Process
Four stages from task definition to optimized instruction
Define the Target Task
Start with the task you want to optimize a prompt for. Provide either a natural language task description, a set of input-output examples demonstrating the desired behavior, or both. You also need a held-out evaluation set — examples with known correct answers that will be used to score candidate instructions. The quality of your evaluation set determines the quality of the optimization.
Task: Sentiment classification. Training examples: 20 labeled reviews. Evaluation set: 50 labeled reviews held out for scoring. Metric: classification accuracy.
Generate Candidate Instructions
Use an LLM to generate multiple candidate instructions for the task. The generation prompt asks the model to propose various ways to instruct another model to perform the task. APE typically generates 20-50 candidates in this phase, using different generation strategies: forward (describe the task), reverse (infer instruction from examples), and paraphrase (rephrase existing instructions). Diversity in candidates is critical for exploring the search space effectively.
Candidate 1: “Classify the sentiment as positive or negative.”
Candidate 2: “Read the review and determine whether the author’s overall feeling is positive or negative.”
Candidate 3: “Based on the tone and content, label this review as positive or negative sentiment.”
...(20+ more candidates)
Evaluate and Score Each Candidate
Run each candidate instruction through the target model on the evaluation set. Score each instruction by how accurately it produces correct outputs. This step is the heart of APE — it provides objective, data-driven ranking of instructions rather than relying on human intuition about what sounds like a good prompt. The evaluation is automated and can test hundreds of candidates efficiently.
Candidate 1: 78% accuracy on eval set
Candidate 2: 85% accuracy on eval set
Candidate 3: 82% accuracy on eval set
Candidate 17: 91% accuracy on eval set (unexpected winner)
Select and Optionally Refine
Select the top-performing instruction(s). Optionally, generate variations of the best candidates by paraphrasing, extending, or combining them, then evaluate again. This iterative refinement can squeeze additional performance from the search process. The final output is a rigorously tested, data-validated instruction that you can deploy with confidence. Always verify the winning instruction on a final holdout set to guard against overfitting to the evaluation data.
Best instruction after 2 rounds: “Read the customer review carefully. Determine whether the overall sentiment expressed is positive or negative. Consider the balance of positive and negative statements, but weight the concluding sentiment most heavily. Respond with only ‘positive’ or ‘negative.’” — 93% accuracy, up from 78% for the naive baseline.
See the Difference
Why automated search discovers prompts humans miss
Manual Prompt Engineering
Human writes: “Let’s think step by step.” Tests it on a few examples. Notices some failures. Tweaks to: “Think through this carefully, step by step.” Tests again. After 5-10 iterations over several hours, settles on the best version they can think of.
A prompt that works well for the cases the human thought to test, limited by one person’s creativity and assumptions about effective phrasing.
APE
APE generates 30 candidate instructions automatically. Each is evaluated against 50 held-out examples. Top 5 candidates are refined through paraphrase generation. Second-round evaluation selects the winner: a phrasing with a quality-motivation clause no human had considered.
A prompt validated against objective metrics, discovered through systematic search rather than intuition. Often outperforms the best human-written alternative by 5-15 percentage points.
Natural Language Works Too
While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.
APE in Action
See how automated prompt search discovers better instructions
Task: Classify support tickets by priority (P1-Critical, P2-High, P3-Medium, P4-Low).
Human baseline prompt: “Classify this support ticket by priority level.” — 71% accuracy.
APE generates 25 candidates. Top 3 after evaluation:
- “Read the support ticket and assign a priority. P1 for outages or data loss, P2 for degraded service, P3 for feature issues, P4 for questions.” — 86%
- “Determine urgency: is this an emergency (P1), serious problem (P2), minor issue (P3), or informational (P4)?” — 83%
- “Classify priority by business impact: P1=revenue impact, P2=productivity impact, P3=inconvenience, P4=no immediate impact.” — 89%
The winning instruction (89%) includes specific criteria for each priority level framed in terms of business impact — a framing no human had tried. APE discovered that explicit category definitions dramatically improve classification accuracy. Note: Always validate the APE-discovered prompt on a separate holdout set before deploying to production, as optimization against the evaluation set can overfit to its specific distribution.
Task: Mathematical word problems requiring multi-step reasoning.
Human baseline: “Let’s think step by step.” — 73% accuracy on GSM8K-style problems.
APE generates 30 candidates by paraphrasing and extending the baseline. Selected findings:
- “Let’s work this out in a step by step way to be sure we have the right answer.” — 82%
- “Break this problem into smaller parts. Solve each part carefully, then combine the results.” — 79%
- “First, identify what is being asked. Then list the known quantities. Work through the math one step at a time, checking each step.” — 84%
APE discovered that adding explicit quality motivations (“to be sure we have the right answer”) and procedural structure (“identify, list, work, check”) significantly improves reasoning accuracy. The 84% candidate adds a self-checking mechanism that the human baseline lacked entirely. This demonstrates APE’s power: it explores phrasings and structural variations that manual iteration rarely discovers. Always verify results with your own test cases — benchmark numbers may not transfer directly to your specific use case.
Task: Summarize technical documents into executive-friendly briefs (evaluated by human preference and factual accuracy).
Human baseline: “Summarize this document for a non-technical executive audience in 3-4 sentences.” — 65% preference rate vs. human-written summaries.
APE generates 20 candidates. Top performers:
- “Write an executive summary: start with the business impact, then the key finding, then the recommended action. No jargon. 3 sentences maximum.” — 78%
- “A busy CEO has 30 seconds to read this. What do they need to know? Focus on decisions they need to make, not technical details.” — 81%
- “Translate this technical document into plain language for leadership. Lead with ‘so what’ — why should they care? Then the essential facts. Then what happens next. Three sentences.” — 83%
APE discovered that persona-anchoring (“busy CEO”) and structure-prescribing (“lead with so what”) dramatically outperform the generic “summarize for non-technical audience” instruction. The winning prompt frames the task around the reader’s decision-making needs rather than the document’s content structure — a subtle but impactful shift in framing. Important: Human evaluation of summary quality is inherently subjective; always have multiple reviewers validate AI-generated summaries against source documents for factual accuracy.
When to Use APE
Best for high-stakes tasks where prompt quality directly impacts outcomes
Perfect For
When a prompt runs thousands of times in production and each percentage point of accuracy translates to measurable business impact.
When competing on evaluation metrics and you need to squeeze maximum performance from a model without fine-tuning.
When manual prompt iteration has stalled and you cannot improve performance further through human creativity alone.
When the same task runs on different models — APE can discover model-specific optimal instructions rather than using a one-size-fits-all prompt.
Skip It When
When you are writing a prompt for a single use — the overhead of generating and evaluating dozens of candidates is not justified for a prompt you will use once.
Without labeled examples to score candidates against, APE cannot objectively compare instructions — you need ground truth to drive the search.
APE requires many LLM calls (generation + evaluation per candidate) — for cost-sensitive applications, manual prompt engineering may be more efficient.
Use Cases
Where APE delivers the most value
Content Moderation
Optimize classification prompts for detecting policy violations across millions of posts — even small accuracy improvements at scale prevent thousands of misclassifications daily.
Document Processing
Optimize extraction prompts for processing contracts, invoices, or medical records at scale — where precision on field extraction directly impacts downstream workflows.
Medical Triage Prompts
Optimize symptom assessment prompts where classification accuracy has direct patient safety implications — APE’s systematic evaluation prevents reliance on untested human intuition.
Chatbot Response Quality
Optimize system prompts for customer-facing chatbots by evaluating candidate instructions against user satisfaction metrics and resolution rates.
Security Alert Classification
Optimize prompts for classifying security alerts by severity — where false negatives on critical threats have severe consequences and prompt accuracy is paramount.
Model Migration
When switching between LLM providers, use APE to discover model-specific optimal prompts rather than assuming the same instruction works equally well across all models.
Where APE Fits
APE bridges manual prompt engineering and fully automated optimization pipelines
If you find APE valuable, consider graduating to DSPy — a framework that industrializes APE’s principles. DSPy treats prompts as programs with optimizable parameters, automatically compiling natural language signatures into optimized prompts through the same generate-evaluate-select loop that APE pioneered, but with more sophisticated search strategies and multi-stage pipeline support.
Related Techniques
Explore complementary optimization and meta-learning techniques
Automate Your Prompt Engineering
Explore APE-inspired optimization or build structured prompts with our tools to maximize AI performance.