Self-Verification

Technique Context: 2022-2023

Background: Self-Verification builds on backward reasoning and constraint-checking concepts from formal methods and mathematical proof theory. As a prompting technique, it gained traction in 2022-2023 as researchers demonstrated that explicitly asking LLMs to verify their own answers — by substituting solutions back into problems — significantly improved accuracy on math, logic, and constraint-satisfaction tasks.

Modern LLM Status: Self-Verification remains a valuable and practical prompting technique. While modern LLMs (Claude, GPT-4) show improved reasoning capabilities, they still benefit significantly from explicit verification prompts, especially on multi-step math problems and constraint-heavy tasks. Some models now perform implicit verification in their extended thinking modes, but explicit backward-checking prompts remain more reliable in 2025-2026.

The Concept

The Backward Check

Self-Verification applies a principle every math teacher knows: checking your work is easier than doing it right the first time. After the model generates an answer, it reverses direction — plugging the answer back into the original problem to see if everything holds up.

This works because verification and generation use fundamentally different cognitive paths. Generating the right answer requires exploring a vast solution space, but verification just asks a binary question: "Given this answer, does the original problem check out?" This asymmetry means even models that make mistakes during generation can reliably catch those mistakes during verification.

Key Insight

Finding a needle in a haystack is hard. But once someone hands you a needle, checking whether it came from that haystack is easy. Self-Verification exploits this fundamental asymmetry between search and validation.

The Core Pattern

Step 1: Generate an answer to the problem.

Step 2: Formulate verification conditions — what must be true if this answer is correct?

Step 3: Test each condition against the answer.

Step 4: If any condition fails, flag the error and regenerate.

Why It Works

Three Verification Strategies

Backward Verification

Substitute the answer back into the original problem and check if it satisfies all equations or conditions. This is the gold standard for math, logic puzzles, and constraint satisfaction problems — if x = 5, does 3x + 7 actually equal 22?

Constraint Checking

Extract every explicit and implicit constraint from the problem, then systematically verify each one against the proposed answer. Catches partial solutions that satisfy some requirements but miss others — the "forgot about the edge case" problem.

Validation Checking

Apply common-sense reasonableness tests even when formal verification isn't possible. Does the population figure seem plausible? Is the date within a realistic range? Does the code compile? Catches wildly wrong answers that pass syntactic checks.

Process

The Verification Pipeline

A systematic four-step process that turns every answer into a testable hypothesis.

1

Generate Initial Answer

The model produces its best answer using standard reasoning — chain-of-thought, decomposition, or whatever approach fits the problem. This answer becomes the hypothesis to be tested.

2

Extract Verification Conditions

Analyze the original problem to identify every condition the answer must satisfy. For math: equations that must balance. For scheduling: constraints that must be met. For code: test cases that must pass. Turn abstract correctness into concrete, checkable conditions.

3

Execute Verification Checks

Systematically test the answer against each condition. Substitute values, check constraints, run sanity tests. Document each check as pass or fail with specific evidence.

4

Accept, Revise, or Regenerate

If all checks pass, accept the answer with confidence. If checks fail, either revise the specific failing component or regenerate from scratch with the failure information guiding the new attempt.

Comparison

See the Difference

Without Verification

Q: A store has a "buy 2, get 1 free" deal. If each item costs $15, how much do you pay for 7 items?

With "buy 2, get 1 free," every 3 items costs $30 (you pay for 2). For 7 items: 7 ÷ 3 = 2 groups with 1 remaining. Cost: 2 × $30 + $15 = $75.

Looks plausible, but is it correct? No verification step — the user must manually check.

VS

With Self-Verification

Same question, with backward check:

Answer: $75

Verification:

• 7 items in groups of 3: [2 paid + 1 free], [2 paid + 1 free], [1 remaining]

• Free items: 2. Paid items: 5.

• Total: 5 × $15 = $75 ✔

• Validation check: $75 for 7 items at $15 each (max $105) — 29% discount from a "buy 2 get 1 free" deal seems correct ✔

In Practice

Self-Verification in Action

SQL Query Verification

Problem

"Write a SQL query to find customers who placed more than 3 orders in the last 30 days but have never returned an item."

Generated query:

SELECT c.customer_id, c.name FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_date >= DATE_SUB(NOW(), INTERVAL 30 DAY) GROUP BY c.customer_id HAVING COUNT(o.order_id) > 3

Verification Checklist

Constraint 1: More than 3 orders? ✔ HAVING COUNT > 3

Constraint 2: In the last 30 days? ✔ WHERE clause filters dates

Constraint 3: Never returned an item? ✘ MISSING! No check against returns table.

Verdict: FAIL. Query needs a NOT EXISTS subquery or LEFT JOIN against the returns table to exclude customers with any return history.

Revised: Add AND c.customer_id NOT IN (SELECT customer_id FROM returns)

Logic Puzzle: The River Crossing

Problem

"A farmer has a fox, a chicken, and a sack of grain. He needs to cross a river in a boat that can only carry himself and one item. The fox will eat the chicken if left alone, and the chicken will eat the grain."

Proposed solution: 1) Take chicken across. 2) Return. 3) Take fox across. 4) Bring chicken back. 5) Take grain across. 6) Return. 7) Take chicken across.

Step-by-Step Verification

After step 1: Left bank: fox, grain. Right: chicken. ✔ Fox doesn't eat grain.

After step 3: Left: grain. Right: fox, chicken. ✘ Fox eats chicken!

Wait — step 4 brings chicken back. Re-reading: Steps 3-4 happen together (take fox, return with chicken).

After step 4: Left: chicken. Right: fox. ✔

After step 5: Left: chicken. Right: fox, grain. ✔ Fox doesn't eat grain.

After step 7: Left: empty. Right: fox, chicken, grain. ✔ All across safely.

Verdict: PASS. All constraints satisfied at every intermediate state.

Unit Conversion Validation Check

Problem

"A car travels at 60 mph for 2.5 hours. How far does it travel in kilometers?"

Generated answer: 60 × 2.5 = 150 miles = 150 × 1.6 = 240 km

Multi-Layer Verification

Backward check: 240 km ÷ 1.609 = 149.2 miles. 149.2 ÷ 2.5 hours = 59.7 mph ≈ 60 mph ✔

Precision check: Exact conversion is 1.609 km/mile, not 1.6. Precise answer: 150 × 1.609 = 241.4 km. The 1.6 approximation introduced a 0.6% error.

Validation check: ~240 km in 2.5 hours at highway speed — consistent with real-world driving. ✔

Verdict: PASS with note. Answer is approximately correct. For engineering contexts, recommend using the precise 1.609 conversion factor.

Implementation

Verification Patterns

Three ways to integrate Self-Verification into your prompting workflow, from simple to robust.

Single-Turn

Include "After answering, verify your work by checking each constraint" in the original prompt. Simple and fast, but the model may skip or shortcut the verification when it's embedded in the same prompt.

Two-Turn

Generate the answer first, then send a separate prompt: "Given this problem and this answer, verify whether the answer is correct." The separation forces genuine re-examination rather than rubber-stamping.

Verify-and-Regenerate

If verification fails, feed the specific failure back as context for a new generation attempt. "Your answer failed because X — solve again avoiding this error." Most robust, catches the hardest errors.

When to Use

Perfect For

Math & Logic Problems

Where answers can be substituted back into equations to confirm correctness — the most natural fit for backward verification.

Constraint Satisfaction

Scheduling, resource allocation, and configuration tasks where every constraint can be independently verified against the solution.

Testable Code Generation

Code outputs that can be tested against requirements — run the code, check the results, verify edge cases.

Clear Correctness Criteria

Any problem with well-defined correctness criteria that can be checked independently of the generation process.

Limitations

Skip It When

Subjective Tasks

Creative writing and design opinions where “correct” is a matter of taste, not verifiable criteria.

Open-Ended Questions

Questions without definite verification criteria — there’s nothing concrete to check the answer against.

Same-Knowledge Blind Spots

When verification requires the same knowledge that produced the error — the model can’t catch what it doesn’t know it got wrong.

Applications

Use Case Showcase

Mathematical Problem Solving

Plug answers back into original equations to catch arithmetic errors, sign mistakes, and misapplied formulas — the most natural fit for backward verification.

SQL and Database Queries

Check that every WHERE clause, JOIN condition, and GROUP BY column actually addresses a requirement from the original question. Catches the "forgot a constraint" error pattern.

Regex Pattern Matching

Test the generated regex against example inputs — both strings that should match and strings that shouldn't. Verification reveals over-matching and under-matching immediately.

Meeting and Event Scheduling

Verify that proposed schedules satisfy every constraint: time zones, availability windows, duration requirements, room capacity, and buffer time between events.

Configuration Files

Validate that generated configs (YAML, JSON, TOML) meet all specified requirements: correct ports, proper environment variables, matching service dependencies, and valid syntax.

Legal and Compliance Checks

Verify that drafted policies or contract clauses satisfy all specified regulatory requirements — checking each compliance point as a constraint against the generated text.

Context

Verification vs. Other Self-Correction

Self-Verification focuses on answer correctness — not quality or style. It answers "is this right?" while other frameworks ask "is this good enough?"

Correctness

Self-Verification

Binary pass/fail checking — does the answer satisfy all constraints? Best for problems with definite right answers.

Quality

Self-Refine

Spectrum-based improvement — is the answer good enough, and how can it be better? Best for writing and creative tasks.

Reliability

Self-Calibration

Confidence assessment — how certain is the model about its answer? Best for flagging uncertain responses before they cause problems.

Related Techniques

Systematic Chain-of-Verification Extends verification to factual claims by generating independent verification questions — best when the "constraints" are factual accuracy rather than logical conditions.

Quality Focus Self-Refine Where Self-Verification checks correctness, Self-Refine focuses on quality improvement — iterating on tone, clarity, and completeness through self-critique cycles.

Reasoning Reversing Chain-of-Thought Applies backward verification to the reasoning chain itself — checking not just the final answer but whether each step in the chain logically follows from the previous one.

Catch Errors Before They Matter

Add verification steps to your prompts with our interactive tools, or explore more self-correction frameworks.

Prompt Builder All Foundations