Chain of Code (CoC)
Some reasoning tasks need real computation. Others need language understanding. Chain of Code handles both — interweaving actual code execution with language model simulation to reason across executable and semantic boundaries in a single unified process.
Introduced: Chain of Code was published in 2023, extending the Program of Thoughts approach by introducing the “LMulator” concept. The key innovation is a hybrid execution model: when code is directly executable (arithmetic, data processing, sorting), it runs in a real interpreter. When the task is semantic (sentiment analysis, commonsense reasoning), the language model simulates what the code would return. This interleaving achieved 84% on BIG-Bench Hard, a +12% improvement over standard Chain-of-Thought prompting.
Modern LLM Status: Chain of Code’s hybrid approach — mixing real code execution with language model simulation — has become a foundational pattern in 2026. Modern coding assistants routinely interweave executable and non-executable reasoning. The LMulator concept anticipated how modern agents handle tasks that partially require computation and partially require language understanding. Today’s AI systems like Claude, GPT-4, and Gemini naturally blend code execution with natural language reasoning in their tool-use capabilities.
Code That Thinks, Language That Computes
Traditional code execution fails when reasoning requires common sense or world knowledge. Traditional language reasoning fails when problems require precise computation. Chain of Code bridges this gap by creating a seamless pipeline where the model writes code as its reasoning medium, then selectively executes or simulates each step depending on whether the operation is computable or semantic.
The LMulator is the breakthrough. When the model encounters a line of code that a Python interpreter cannot run — like is_sarcastic("Oh great, another meeting") — the language model steps in as a simulated interpreter, returning what the function would produce based on its understanding of language. The real interpreter handles arithmetic, sorting, and data manipulation. The LMulator handles meaning, context, and judgment.
Think of it as a relay race between a calculator and a poet. The calculator handles the numbers; the poet handles the nuance. Together, they solve problems neither could tackle alone.
Pure code execution (Program of Thoughts) breaks down when tasks require semantic understanding — you cannot write a Python function that genuinely determines if a statement is ironic. Pure language reasoning (Chain-of-Thought) introduces arithmetic errors and logical inconsistencies. Chain of Code’s interleaving approach routes each sub-problem to the right engine: computation to the interpreter, meaning to the model. This selective routing is why CoC achieves +12% over CoT on BIG-Bench Hard.
The Chain of Code Process
Four stages from problem to hybrid-executed solution
Generate Code as Reasoning
The model writes code that encodes the full reasoning process — not just the computable parts, but also semantic operations expressed as function calls. The code serves as a structured reasoning scaffold, making every step explicit and ordered.
Given “How many of these items are edible: a rock, an apple, a car tire, bread, a pencil?” the model writes code like: items = ["rock", "apple", "car tire", "bread", "pencil"]; edible = [x for x in items if is_edible(x)]; count = len(edible)
Attempt Real Execution
The generated code is sent to an actual Python interpreter. Lines that are purely computational — arithmetic, list operations, string manipulation — execute normally and return real results. This ensures mathematical precision that language models alone cannot guarantee.
The interpreter executes len(edible) and list comprehension syntax correctly, but hits an error on is_edible(x) because that function is not defined in Python — it requires world knowledge.
LMulator Simulates Semantic Steps
When the interpreter encounters a function it cannot execute — one that requires semantic understanding — the language model acts as an “LMulator” (LM + emulator). It simulates what the function would return based on its knowledge, providing the result as if the code had run successfully.
The LMulator processes is_edible("rock") → False, is_edible("apple") → True, is_edible("car tire") → False, is_edible("bread") → True, is_edible("pencil") → False.
Combine Results into Final Answer
The simulated semantic results are fed back into the code pipeline. The interpreter completes the remaining computation using the LMulator’s outputs, producing a final answer that combines the precision of code execution with the understanding of language. Always verify AI-generated results independently.
With the LMulator results injected, edible = ["apple", "bread"] and count = len(edible) → 2. Final answer: “2 of the 5 items are edible: apple and bread.”
See the Difference
Why hybrid code-language reasoning outperforms either approach alone
Chain-of-Thought Only
A store has 47 apples. A customer buys 13, another returns 5, and a third buys half of what remains. Also, which of these fruits are tropical: apple, mango, strawberry, papaya? How many tropical fruits multiplied by remaining apples gives what total?
47 minus 13 is 34, plus 5 is 39, half is about 19. Tropical fruits are mango and papaya, so 2. 2 times 19 is 38. (Note: always verify AI calculations independently.)
Chain of Code
Code: apples = 47 - 13 + 5; remaining = apples // 2 → Interpreter: remaining = 19 (exact integer division).
Semantic: is_tropical("mango") → LMulator: True; is_tropical("papaya") → True; is_tropical("apple") → False; is_tropical("strawberry") → False.
Code: total = len(tropical) * remaining → Interpreter: 2 * 19 = 38.
Remaining apples: 19 (exact). Tropical fruits: mango and papaya (2). Total: 2 × 19 = 38. Each step computed or simulated with full traceability. Verify these results against your own knowledge before relying on them.
Natural Language Works Too
While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.
Chain of Code in Action
See how hybrid code-language execution tackles mixed reasoning tasks
“Sort these activities by how physically demanding they are: reading a book, running a marathon, cooking dinner, climbing Mount Everest, walking to the mailbox. Then calculate the median index position.”
Semantic (LMulator): physical_score("reading a book") → 1, physical_score("walking to the mailbox") → 2, physical_score("cooking dinner") → 3, physical_score("running a marathon") → 4, physical_score("climbing Mount Everest") → 5.
Code (Interpreter): sorted_activities = sorted(activities, key=physical_score); median_idx = len(sorted_activities) // 2 → median_idx = 2.
Final answer: Sorted order: reading, walking, cooking, marathon, Everest. Median position: index 2 (cooking dinner). Remember to verify AI-generated rankings reflect your own judgment of physical difficulty.
“Given these 5 product reviews, count how many are positive, how many are negative, and calculate the percentage of positive reviews. Reviews: (1) ‘Absolutely love it!’ (2) ‘Terrible quality, broke on day one.’ (3) ‘Best purchase this year.’ (4) ‘Would not recommend to anyone.’ (5) ‘Solid product, works as advertised.’”
Semantic (LMulator): sentiment("Absolutely love it!") → "positive"; sentiment("Terrible quality, broke on day one.") → "negative"; sentiment("Best purchase this year.") → "positive"; sentiment("Would not recommend to anyone.") → "negative"; sentiment("Solid product, works as advertised.") → "positive".
Code (Interpreter): positive = 3; negative = 2; pct = (positive / 5) * 100 → 60.0.
Final answer: 3 positive reviews, 2 negative reviews, 60% positive rate. Sentiment classification is subjective — always verify AI sentiment analysis against your own reading of the text.
“From this list of cities — Tokyo, Paris, Sydney, Cairo, Toronto — identify which are in the Northern Hemisphere, sort them alphabetically, and report the count and sorted list.”
Semantic (LMulator): is_northern("Tokyo") → True; is_northern("Paris") → True; is_northern("Sydney") → False; is_northern("Cairo") → True; is_northern("Toronto") → True.
Code (Interpreter): northern = [c for c in cities if is_northern(c)]; northern.sort(); count = len(northern) → ["Cairo", "Paris", "Tokyo", "Toronto"], count = 4.
Final answer: 4 cities in the Northern Hemisphere: Cairo, Paris, Tokyo, Toronto (alphabetical). Sydney is in the Southern Hemisphere. Geographic classifications should be verified against authoritative sources.
When to Use Chain of Code
Best for tasks that mix computation with semantic understanding
Perfect For
Problems that require both precise computation (math, counting, sorting) and semantic judgment (classification, sentiment, common sense) in a single pipeline.
Complex benchmark tasks that stump pure CoT — CoC achieved 84% accuracy on BIG-Bench Hard by combining the strengths of both reasoning modes.
When you need to filter, sort, or aggregate data based on criteria that require world knowledge rather than simple value comparisons.
Building AI agents that need to decide dynamically whether to execute code or reason linguistically at each step of a multi-step task.
Skip It When
If the entire problem is executable code (math, data transforms), standard Program of Thoughts or PAL is simpler and sufficient.
Writing, brainstorming, and open-ended generation tasks that involve no computation — the code execution layer adds overhead without benefit.
When you lack access to a code interpreter, CoC loses its primary advantage — use Chain-of-Thought instead for language-only environments.
Use Cases
Where Chain of Code delivers the most value
Document Classification with Stats
Classify documents by topic using semantic understanding, then compute distribution statistics, percentages, and trends using precise arithmetic — all in one pass.
Survey Response Analysis
Parse free-text survey responses for sentiment and themes (semantic), then aggregate counts, compute averages, and generate summary statistics (computational).
Scientific Data Curation
Determine which experimental observations are “anomalous” using domain knowledge (semantic), then apply statistical outlier detection on the remaining valid data (computational).
E-Commerce Product Ranking
Evaluate product descriptions for relevance to a user query (semantic), then score and rank results using weighted algorithms (computational).
Compliance Checking
Interpret regulatory requirements in natural language (semantic), then verify structured data records against those requirements using formal logic checks (computational).
Financial Report Parsing
Extract qualitative insights from earnings call transcripts (semantic), then combine with numerical financial data for ratio calculations and trend analysis (computational).
Where Chain of Code Fits
CoC unifies the code execution and language reasoning lineages
The LMulator concept — where a language model simulates code execution for operations that require world knowledge — has evolved into the broader “tool use” paradigm in modern AI. Today’s agents dynamically decide whether to run code, call an API, search the web, or reason linguistically. Chain of Code was one of the first techniques to formalize this selective routing, making it a conceptual ancestor of modern agentic architectures.
Related Techniques
Explore the code reasoning family of techniques
Blend Code and Language Reasoning
Try Chain of Code’s hybrid approach on your own mixed reasoning tasks or explore more code reasoning techniques.