Self-Correction Framework

Reflexion

AI that learns from its mistakes. Reflexion stores verbal reflections on failures in memory, so the next attempt avoids the same pitfalls — without retraining the model.

Framework Context: 2023

Introduced: Reflexion was published in 2023 by Shinn et al. It introduced verbal reinforcement learning — where an AI agent stores natural-language reflections on failures in a memory buffer, then uses those reflections to improve subsequent attempts without any model retraining.

Modern LLM Status: Reflexion has deeply influenced modern AI agent architecture. The reflect-and-retry pattern with persistent memory is now a foundational design principle in production agent frameworks (LangChain, AutoGPT, Claude's agentic workflows). While not a built-in LLM feature, the concept is essential knowledge for anyone building multi-step AI agents in 2025-2026.

The Core Insight

Fail Once, Learn Forever

When a person fails at something, they don't just retry the exact same way. They think about what went wrong, form a lesson, and approach the problem differently next time. That learning persists.

Reflexion gives AI the same ability. When a task fails, instead of blindly retrying, the agent writes a verbal reflection — an explicit analysis of what went wrong and what to do differently. This reflection is stored in memory and injected into the next attempt's context.

The result: performance improves across attempts without changing the model's weights. It's in-context learning from failure.

Why Not Just Retry?

Simple retries suffer from the same blind spots — the model makes similar mistakes each time. Reflexion explicitly identifies the failure mode ("I forgot to handle edge cases") and addresses it in the next attempt. It's the difference between guessing randomly and learning systematically.

The Three Components

How the Reflexion architecture enables learning from failure

The Reflexion Loop

From failure to insight to improved performance

1

Attempt the Task

The Actor executes the task using its current knowledge plus any stored reflections from previous attempts. On the first run, there's no memory context.

Example

Task: "Write a function that finds the longest common subsequence of two strings."

2

Evaluate the Result

The Evaluator checks whether the attempt succeeded. For coding tasks, this means running test cases. For reasoning tasks, it might be checking the answer. The evaluation produces a pass/fail signal.

Result

Test results: 3 of 5 tests pass. Fails on empty strings and strings with no common characters.

3

Generate Reflection

On failure, the Self-Reflection component analyzes what went wrong. It produces a natural language explanation — not just "it failed" but specifically why and how to fix it.

Reflection

"My implementation didn't handle edge cases: empty strings should return an empty string, and when no common characters exist, the result should also be empty. I need to add base case checks at the start of the function."

4

Store in Memory

The reflection is added to a persistent memory store. All future attempts on this task (or similar tasks) will receive these reflections as part of their input context.

Memory Entry

Stored: "Always handle edge cases first — empty inputs, null values, and degenerate cases where the expected result is trivial."

5

Retry with Memory

The Actor attempts the task again, but now with the reflection in its context. It reads the lessons learned and actively avoids the previous failure mode. This cycle can repeat until the task succeeds.

Result

Attempt 2: Adds base case handling for empty strings. All 5 tests pass.

See the Difference

Why reflective retrying beats blind retrying

Blind Retry

Attempt 1

Writes API integration. Fails: doesn't handle rate limiting.

Attempt 2

Writes API integration again. Fails: still doesn't handle rate limiting. Makes the same class of mistake because nothing was learned.

Same mistakes, no learning between attempts
VS

Reflexion

Attempt 1 + Reflection

Fails: no rate limiting. Reflection: "I need to implement exponential backoff and respect Retry-After headers for any external API call."

Attempt 2 (with memory)

Reads stored reflection. Implements retry logic with exponential backoff and header checking from the start. Passes all tests.

Learns from failure, avoids repeating mistakes

Natural Language Works Too

While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.

Reflexion in Action

Multi-attempt scenarios showing how reflection drives improvement

Attempt 1 — Failed

Task: Write a function to check if a string is a palindrome.

Code: Simply reverses the string and compares.

Tests fail on: "A man, a plan, a canal: Panama" (spaces, punctuation, mixed case)

Reflection + Attempt 2 — Passed

Reflection: "My solution failed because I didn't normalize the input. For palindrome checks, I need to: 1) Convert to lowercase, 2) Remove non-alphanumeric characters, 3) Then compare. String processing tasks almost always need input normalization."

Attempt 2: Adds normalization step — converts to lowercase, strips non-alphanumeric characters, then compares. All tests pass.

Attempt 1 — Wrong Answer

Question: "A store sells apples for $2 each. With a 'buy 3, get 1 free' deal, how much do 10 apples cost?"

Answer: "$20" (simply multiplied 10 × $2, ignoring the deal)

Reflection + Attempt 2 — Correct

Reflection: "I ignored the discount deal entirely. With 'buy 3 get 1 free', every 4 apples only costs 3 × $2 = $6. For 10 apples: 2 full groups of 4 (cost: $12) + 2 remaining apples (cost: $4) = $16. I need to always check for special pricing before doing simple multiplication."

Attempt 2: Correctly calculates $16 by applying the deal to groups of 4.

Attempt 1 — Incomplete

Task: "Compare the environmental impact of electric vs gas vehicles."

Response: Only discusses tailpipe emissions, concluding EVs have zero environmental impact.

Reflection + Attempt 2 — Comprehensive

Reflection: "My analysis was one-dimensional. I only considered tailpipe emissions. A thorough comparison needs to include: battery manufacturing emissions, electricity source (coal vs renewable), battery disposal/recycling, and total lifecycle analysis. Comparative analyses should always consider the full lifecycle, not just the most visible factor."

Attempt 2: Provides full lifecycle comparison covering manufacturing, operation, electricity source, battery disposal, and total lifecycle CO2 — a much more nuanced and accurate analysis.

When to Use Reflexion

Best for tasks with clear success/failure signals and multiple attempts

Perfect For

Code Generation

Test suites provide perfect pass/fail signals. Reflections help the agent fix bugs systematically rather than randomly.

Sequential Decision Making

Tasks where the agent takes multiple steps and needs to learn which strategies work in which situations.

Multi-Attempt Workflows

Any scenario where the agent gets multiple tries and can benefit from accumulating knowledge across attempts.

Complex Problem Solving

Problems where the solution space is large and systematic elimination of wrong approaches accelerates finding the right one.

Skip It When

Single-Shot Tasks

When you only get one attempt, there's no opportunity for the reflection loop to provide value.

No Clear Evaluation Signal

Reflexion needs a way to determine success or failure. Subjective tasks without objective metrics are poor fits.

Simple Tasks

If the task is straightforward enough that the first attempt usually succeeds, the overhead of reflection isn't worth it.

Use Cases

Where Reflexion delivers the most value

Automated Coding

Generate code, run tests, reflect on failures, and iterate until all tests pass — the ideal Reflexion use case.

Question Answering

When answers can be verified against ground truth, reflect on wrong answers to improve reasoning strategies.

Debugging Agents

AI agents that debug systems can store reflections about what approaches worked for similar issues.

Decision-Making Agents

Agents navigating complex environments learn which actions lead to dead ends and adapt their strategy.

Data Pipeline Design

Iteratively build and test data transformations, learning from validation errors to build correct pipelines.

Prompt Engineering

Iteratively refine prompts by reflecting on which formulations produce better outputs and why.

Where Reflexion Fits

Reflexion adds memory and learning to the self-correction family

Self-Refine Single Session Improve within one try
CRITIC Tool Check Verify with tools
CoVe Fact Chain Independent verification
Reflexion Memory Learning Learn across attempts
The Key Difference

Self-Refine improves a single output within one session. CRITIC and CoVe verify factual accuracy. Reflexion is unique because it creates persistent memory — lessons learned from failure carry forward to future attempts, making the agent progressively smarter without retraining.

Build Learning Agents

Explore Reflexion-style learning with our interactive tools or discover more self-correction frameworks.