Code Reasoning Technique

Chain of Code (CoC)

Some reasoning tasks need real computation. Others need language understanding. Chain of Code handles both — interweaving actual code execution with language model simulation to reason across executable and semantic boundaries in a single unified process.

Technique Context: 2023

Introduced: Chain of Code was published in 2023, extending the Program of Thoughts approach by introducing the “LMulator” concept. The key innovation is a hybrid execution model: when code is directly executable (arithmetic, data processing, sorting), it runs in a real interpreter. When the task is semantic (sentiment analysis, commonsense reasoning), the language model simulates what the code would return. This interleaving achieved 84% on BIG-Bench Hard, a +12% improvement over standard Chain-of-Thought prompting.

Modern LLM Status: Chain of Code’s hybrid approach — mixing real code execution with language model simulation — has become a foundational pattern in 2026. Modern coding assistants routinely interweave executable and non-executable reasoning. The LMulator concept anticipated how modern agents handle tasks that partially require computation and partially require language understanding. Today’s AI systems like Claude, GPT-4, and Gemini naturally blend code execution with natural language reasoning in their tool-use capabilities.

The Core Insight

Code That Thinks, Language That Computes

Traditional code execution fails when reasoning requires common sense or world knowledge. Traditional language reasoning fails when problems require precise computation. Chain of Code bridges this gap by creating a seamless pipeline where the model writes code as its reasoning medium, then selectively executes or simulates each step depending on whether the operation is computable or semantic.

The LMulator is the breakthrough. When the model encounters a line of code that a Python interpreter cannot run — like is_sarcastic("Oh great, another meeting") — the language model steps in as a simulated interpreter, returning what the function would produce based on its understanding of language. The real interpreter handles arithmetic, sorting, and data manipulation. The LMulator handles meaning, context, and judgment.

Think of it as a relay race between a calculator and a poet. The calculator handles the numbers; the poet handles the nuance. Together, they solve problems neither could tackle alone.

Why Hybrid Execution Outperforms Pure Approaches

Pure code execution (Program of Thoughts) breaks down when tasks require semantic understanding — you cannot write a Python function that genuinely determines if a statement is ironic. Pure language reasoning (Chain-of-Thought) introduces arithmetic errors and logical inconsistencies. Chain of Code’s interleaving approach routes each sub-problem to the right engine: computation to the interpreter, meaning to the model. This selective routing is why CoC achieves +12% over CoT on BIG-Bench Hard.

The Chain of Code Process

Four stages from problem to hybrid-executed solution

1

Generate Code as Reasoning

The model writes code that encodes the full reasoning process — not just the computable parts, but also semantic operations expressed as function calls. The code serves as a structured reasoning scaffold, making every step explicit and ordered.

Example

Given “How many of these items are edible: a rock, an apple, a car tire, bread, a pencil?” the model writes code like: items = ["rock", "apple", "car tire", "bread", "pencil"]; edible = [x for x in items if is_edible(x)]; count = len(edible)

2

Attempt Real Execution

The generated code is sent to an actual Python interpreter. Lines that are purely computational — arithmetic, list operations, string manipulation — execute normally and return real results. This ensures mathematical precision that language models alone cannot guarantee.

Example

The interpreter executes len(edible) and list comprehension syntax correctly, but hits an error on is_edible(x) because that function is not defined in Python — it requires world knowledge.

3

LMulator Simulates Semantic Steps

When the interpreter encounters a function it cannot execute — one that requires semantic understanding — the language model acts as an “LMulator” (LM + emulator). It simulates what the function would return based on its knowledge, providing the result as if the code had run successfully.

Example

The LMulator processes is_edible("rock")False, is_edible("apple")True, is_edible("car tire")False, is_edible("bread")True, is_edible("pencil")False.

4

Combine Results into Final Answer

The simulated semantic results are fed back into the code pipeline. The interpreter completes the remaining computation using the LMulator’s outputs, producing a final answer that combines the precision of code execution with the understanding of language. Always verify AI-generated results independently.

Example

With the LMulator results injected, edible = ["apple", "bread"] and count = len(edible)2. Final answer: “2 of the 5 items are edible: apple and bread.”

See the Difference

Why hybrid code-language reasoning outperforms either approach alone

Chain-of-Thought Only

Prompt

A store has 47 apples. A customer buys 13, another returns 5, and a third buys half of what remains. Also, which of these fruits are tropical: apple, mango, strawberry, papaya? How many tropical fruits multiplied by remaining apples gives what total?

Response

47 minus 13 is 34, plus 5 is 39, half is about 19. Tropical fruits are mango and papaya, so 2. 2 times 19 is 38. (Note: always verify AI calculations independently.)

Arithmetic imprecise (“about 19”), semantic reasoning unstructured
VS

Chain of Code

Hybrid Reasoning

Code: apples = 47 - 13 + 5; remaining = apples // 2 → Interpreter: remaining = 19 (exact integer division).
Semantic: is_tropical("mango") → LMulator: True; is_tropical("papaya")True; is_tropical("apple")False; is_tropical("strawberry")False.
Code: total = len(tropical) * remaining → Interpreter: 2 * 19 = 38.

Final Answer

Remaining apples: 19 (exact). Tropical fruits: mango and papaya (2). Total: 2 × 19 = 38. Each step computed or simulated with full traceability. Verify these results against your own knowledge before relying on them.

Precise computation, structured semantic reasoning, fully traceable

Natural Language Works Too

While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.

Chain of Code in Action

See how hybrid code-language execution tackles mixed reasoning tasks

Task

“Sort these activities by how physically demanding they are: reading a book, running a marathon, cooking dinner, climbing Mount Everest, walking to the mailbox. Then calculate the median index position.”

Chain of Code Execution

Semantic (LMulator): physical_score("reading a book") → 1, physical_score("walking to the mailbox") → 2, physical_score("cooking dinner") → 3, physical_score("running a marathon") → 4, physical_score("climbing Mount Everest") → 5.

Code (Interpreter): sorted_activities = sorted(activities, key=physical_score); median_idx = len(sorted_activities) // 2median_idx = 2.

Final answer: Sorted order: reading, walking, cooking, marathon, Everest. Median position: index 2 (cooking dinner). Remember to verify AI-generated rankings reflect your own judgment of physical difficulty.

Task

“Given these 5 product reviews, count how many are positive, how many are negative, and calculate the percentage of positive reviews. Reviews: (1) ‘Absolutely love it!’ (2) ‘Terrible quality, broke on day one.’ (3) ‘Best purchase this year.’ (4) ‘Would not recommend to anyone.’ (5) ‘Solid product, works as advertised.’”

Chain of Code Execution

Semantic (LMulator): sentiment("Absolutely love it!")"positive"; sentiment("Terrible quality, broke on day one.")"negative"; sentiment("Best purchase this year.")"positive"; sentiment("Would not recommend to anyone.")"negative"; sentiment("Solid product, works as advertised.")"positive".

Code (Interpreter): positive = 3; negative = 2; pct = (positive / 5) * 10060.0.

Final answer: 3 positive reviews, 2 negative reviews, 60% positive rate. Sentiment classification is subjective — always verify AI sentiment analysis against your own reading of the text.

Task

“From this list of cities — Tokyo, Paris, Sydney, Cairo, Toronto — identify which are in the Northern Hemisphere, sort them alphabetically, and report the count and sorted list.”

Chain of Code Execution

Semantic (LMulator): is_northern("Tokyo")True; is_northern("Paris")True; is_northern("Sydney")False; is_northern("Cairo")True; is_northern("Toronto")True.

Code (Interpreter): northern = [c for c in cities if is_northern(c)]; northern.sort(); count = len(northern)["Cairo", "Paris", "Tokyo", "Toronto"], count = 4.

Final answer: 4 cities in the Northern Hemisphere: Cairo, Paris, Tokyo, Toronto (alphabetical). Sydney is in the Southern Hemisphere. Geographic classifications should be verified against authoritative sources.

When to Use Chain of Code

Best for tasks that mix computation with semantic understanding

Perfect For

Mixed Reasoning Tasks

Problems that require both precise computation (math, counting, sorting) and semantic judgment (classification, sentiment, common sense) in a single pipeline.

BIG-Bench Hard Problems

Complex benchmark tasks that stump pure CoT — CoC achieved 84% accuracy on BIG-Bench Hard by combining the strengths of both reasoning modes.

Structured Data with Semantic Filtering

When you need to filter, sort, or aggregate data based on criteria that require world knowledge rather than simple value comparisons.

Agent Pipelines

Building AI agents that need to decide dynamically whether to execute code or reason linguistically at each step of a multi-step task.

Skip It When

Purely Computational Tasks

If the entire problem is executable code (math, data transforms), standard Program of Thoughts or PAL is simpler and sufficient.

Purely Creative Tasks

Writing, brainstorming, and open-ended generation tasks that involve no computation — the code execution layer adds overhead without benefit.

No Code Execution Environment

When you lack access to a code interpreter, CoC loses its primary advantage — use Chain-of-Thought instead for language-only environments.

Use Cases

Where Chain of Code delivers the most value

Document Classification with Stats

Classify documents by topic using semantic understanding, then compute distribution statistics, percentages, and trends using precise arithmetic — all in one pass.

Survey Response Analysis

Parse free-text survey responses for sentiment and themes (semantic), then aggregate counts, compute averages, and generate summary statistics (computational).

Scientific Data Curation

Determine which experimental observations are “anomalous” using domain knowledge (semantic), then apply statistical outlier detection on the remaining valid data (computational).

E-Commerce Product Ranking

Evaluate product descriptions for relevance to a user query (semantic), then score and rank results using weighted algorithms (computational).

Compliance Checking

Interpret regulatory requirements in natural language (semantic), then verify structured data records against those requirements using formal logic checks (computational).

Financial Report Parsing

Extract qualitative insights from earnings call transcripts (semantic), then combine with numerical financial data for ratio calculations and trend analysis (computational).

Where Chain of Code Fits

CoC unifies the code execution and language reasoning lineages

Program of Thoughts Code Only Pure code generation and execution
PAL Code + Interpreter Offload execution to Python
Chain of Code Hybrid Execution Code + LMulator for semantic tasks
Modern Agents Dynamic Tool Use Seamless code and language blending
The LMulator Pattern Today

The LMulator concept — where a language model simulates code execution for operations that require world knowledge — has evolved into the broader “tool use” paradigm in modern AI. Today’s agents dynamically decide whether to run code, call an API, search the web, or reason linguistically. Chain of Code was one of the first techniques to formalize this selective routing, making it a conceptual ancestor of modern agentic architectures.

Blend Code and Language Reasoning

Try Chain of Code’s hybrid approach on your own mixed reasoning tasks or explore more code reasoning techniques.