Self-Verification
Work backwards from the answer to catch errors — the mathematical proof-checker for AI reasoning.
Background: Self-Verification builds on backward reasoning and constraint-checking concepts from formal methods and mathematical proof theory. As a prompting technique, it gained traction in 2022-2023 as researchers demonstrated that explicitly asking LLMs to verify their own answers — by substituting solutions back into problems — significantly improved accuracy on math, logic, and constraint-satisfaction tasks.
Modern LLM Status: Self-Verification remains a valuable and practical prompting technique. While modern LLMs (Claude, GPT-4) show improved reasoning capabilities, they still benefit significantly from explicit verification prompts, especially on multi-step math problems and constraint-heavy tasks. Some models now perform implicit verification in their extended thinking modes, but explicit backward-checking prompts remain more reliable in 2025-2026.
The Backward Check
Self-Verification applies a principle every math teacher knows: checking your work is easier than doing it right the first time. After the model generates an answer, it reverses direction — plugging the answer back into the original problem to see if everything holds up.
This works because verification and generation use fundamentally different cognitive paths. Generating the right answer requires exploring a vast solution space, but verification just asks a binary question: "Given this answer, does the original problem check out?" This asymmetry means even models that make mistakes during generation can reliably catch those mistakes during verification.
Finding a needle in a haystack is hard. But once someone hands you a needle, checking whether it came from that haystack is easy. Self-Verification exploits this fundamental asymmetry between search and validation.
Step 1: Generate an answer to the problem.
Step 2: Formulate verification conditions — what must be true if this answer is correct?
Step 3: Test each condition against the answer.
Step 4: If any condition fails, flag the error and regenerate.
Three Verification Strategies
Backward Verification
Substitute the answer back into the original problem and check if it satisfies all equations or conditions. This is the gold standard for math, logic puzzles, and constraint satisfaction problems — if x = 5, does 3x + 7 actually equal 22?
Constraint Checking
Extract every explicit and implicit constraint from the problem, then systematically verify each one against the proposed answer. Catches partial solutions that satisfy some requirements but miss others — the "forgot about the edge case" problem.
Validation Checking
Apply common-sense reasonableness tests even when formal verification isn't possible. Does the population figure seem plausible? Is the date within a realistic range? Does the code compile? Catches wildly wrong answers that pass syntactic checks.
The Verification Pipeline
A systematic four-step process that turns every answer into a testable hypothesis.
Generate Initial Answer
The model produces its best answer using standard reasoning — chain-of-thought, decomposition, or whatever approach fits the problem. This answer becomes the hypothesis to be tested.
Extract Verification Conditions
Analyze the original problem to identify every condition the answer must satisfy. For math: equations that must balance. For scheduling: constraints that must be met. For code: test cases that must pass. Turn abstract correctness into concrete, checkable conditions.
Execute Verification Checks
Systematically test the answer against each condition. Substitute values, check constraints, run sanity tests. Document each check as pass or fail with specific evidence.
Accept, Revise, or Regenerate
If all checks pass, accept the answer with confidence. If checks fail, either revise the specific failing component or regenerate from scratch with the failure information guiding the new attempt.
See the Difference
Without Verification
Q: A store has a "buy 2, get 1 free" deal. If each item costs $15, how much do you pay for 7 items?
With "buy 2, get 1 free," every 3 items costs $30 (you pay for 2). For 7 items: 7 ÷ 3 = 2 groups with 1 remaining. Cost: 2 × $30 + $15 = $75.
Looks plausible, but is it correct? No verification step — the user must manually check.
With Self-Verification
Same question, with backward check:
Answer: $75
Verification:
• 7 items in groups of 3: [2 paid + 1 free], [2 paid + 1 free], [1 remaining]
• Free items: 2. Paid items: 5.
• Total: 5 × $15 = $75 ✔
• Validation check: $75 for 7 items at $15 each (max $105) — 29% discount from a "buy 2 get 1 free" deal seems correct ✔
Self-Verification in Action
"Write a SQL query to find customers who placed more than 3 orders in the last 30 days but have never returned an item."
Generated query:
SELECT c.customer_id, c.name FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY) GROUP BY c.customer_id HAVING COUNT(o.order_id) > 3
Constraint 1: More than 3 orders? ✔ HAVING COUNT > 3
Constraint 2: In the last 30 days? ✔ WHERE clause filters dates
Constraint 3: Never returned an item? ✘ MISSING! No check against returns table.
Verdict: FAIL. Query needs a NOT EXISTS subquery or LEFT JOIN against the returns table to exclude customers with any return history.
Revised: Add AND c.customer_id NOT IN (SELECT customer_id FROM returns)
"A farmer has a fox, a chicken, and a sack of grain. He needs to cross a river in a boat that can only carry himself and one item. The fox will eat the chicken if left alone, and the chicken will eat the grain."
Proposed solution: 1) Take chicken across. 2) Return. 3) Take fox across. 4) Bring chicken back. 5) Take grain across. 6) Return. 7) Take chicken across.
After step 1: Left bank: fox, grain. Right: chicken. ✔ Fox doesn't eat grain.
After step 3: Left: grain. Right: fox, chicken. ✘ Fox eats chicken!
Wait — step 4 brings chicken back. Re-reading: Steps 3-4 happen together (take fox, return with chicken).
After step 4: Left: chicken. Right: fox. ✔
After step 5: Left: chicken. Right: fox, grain. ✔ Fox doesn't eat grain.
After step 7: Left: empty. Right: fox, chicken, grain. ✔ All across safely.
Verdict: PASS. All constraints satisfied at every intermediate state.
"A car travels at 60 mph for 2.5 hours. How far does it travel in kilometers?"
Generated answer: 60 × 2.5 = 150 miles = 150 × 1.6 = 240 km
Backward check: 240 km ÷ 1.609 = 149.2 miles. 149.2 ÷ 2.5 hours = 59.7 mph ≈ 60 mph ✔
Precision check: Exact conversion is 1.609 km/mile, not 1.6. Precise answer: 150 × 1.609 = 241.4 km. The 1.6 approximation introduced a 0.6% error.
Validation check: ~240 km in 2.5 hours at highway speed — consistent with real-world driving. ✔
Verdict: PASS with note. Answer is approximately correct. For engineering contexts, recommend using the precise 1.609 conversion factor.
Verification Patterns
Three ways to integrate Self-Verification into your prompting workflow, from simple to robust.
Single-Turn
Include "After answering, verify your work by checking each constraint" in the original prompt. Simple and fast, but the model may skip or shortcut the verification when it's embedded in the same prompt.
Two-Turn
Generate the answer first, then send a separate prompt: "Given this problem and this answer, verify whether the answer is correct." The separation forces genuine re-examination rather than rubber-stamping.
Verify-and-Regenerate
If verification fails, feed the specific failure back as context for a new generation attempt. "Your answer failed because X — solve again avoiding this error." Most robust, catches the hardest errors.
Perfect For
Where answers can be substituted back into equations to confirm correctness — the most natural fit for backward verification.
Scheduling, resource allocation, and configuration tasks where every constraint can be independently verified against the solution.
Code outputs that can be tested against requirements — run the code, check the results, verify edge cases.
Any problem with well-defined correctness criteria that can be checked independently of the generation process.
Skip It When
Creative writing and design opinions where “correct” is a matter of taste, not verifiable criteria.
Questions without definite verification criteria — there’s nothing concrete to check the answer against.
When verification requires the same knowledge that produced the error — the model can’t catch what it doesn’t know it got wrong.
Use Case Showcase
Mathematical Problem Solving
Plug answers back into original equations to catch arithmetic errors, sign mistakes, and misapplied formulas — the most natural fit for backward verification.
SQL and Database Queries
Check that every WHERE clause, JOIN condition, and GROUP BY column actually addresses a requirement from the original question. Catches the "forgot a constraint" error pattern.
Regex Pattern Matching
Test the generated regex against example inputs — both strings that should match and strings that shouldn't. Verification reveals over-matching and under-matching immediately.
Meeting and Event Scheduling
Verify that proposed schedules satisfy every constraint: time zones, availability windows, duration requirements, room capacity, and buffer time between events.
Configuration Files
Validate that generated configs (YAML, JSON, TOML) meet all specified requirements: correct ports, proper environment variables, matching service dependencies, and valid syntax.
Legal and Compliance Checks
Verify that drafted policies or contract clauses satisfy all specified regulatory requirements — checking each compliance point as a constraint against the generated text.
Verification vs. Other Self-Correction
Self-Verification focuses on answer correctness — not quality or style. It answers "is this right?" while other frameworks ask "is this good enough?"
Self-Verification
Binary pass/fail checking — does the answer satisfy all constraints? Best for problems with definite right answers.
Self-Refine
Spectrum-based improvement — is the answer good enough, and how can it be better? Best for writing and creative tasks.
Self-Calibration
Confidence assessment — how certain is the model about its answer? Best for flagging uncertain responses before they cause problems.
Related Techniques
Catch Errors Before They Matter
Add verification steps to your prompts with our interactive tools, or explore more self-correction frameworks.