COSP

Technique Context: 2023

Introduced: COSP (Consistency-based Self-adaptive Prompting) was introduced by Wan et al. in 2023 to address one of the most persistent bottlenecks in few-shot prompting: the need for high-quality, human-written demonstrations. Creating effective demonstrations requires domain expertise, careful formatting, and significant manual effort — a process that scales poorly across diverse tasks and domains. COSP solves this by having the model generate its own demonstrations, then using consistency across multiple samples as a reliability signal to filter for only the most trustworthy outputs to serve as pseudo-demonstrations.

Modern LLM Status: COSP remains an active and relevant technique, particularly valuable in automated prompt engineering pipelines and production systems that need domain-specific demonstrations at scale. As models have grown more capable, the quality of self-generated demonstrations has improved correspondingly, making COSP even more effective than when it was first proposed. The core principle — that consistency signals reliability — has influenced subsequent work in self-adaptive prompting, automatic demonstration generation, and LLM-as-judge paradigms where models evaluate their own outputs.

The Core Insight

Let the Model Teach Itself

Few-shot prompting is one of the most effective ways to improve LLM performance — providing input-output examples in the prompt gives the model concrete patterns to follow. But creating those demonstrations has always required human effort: someone must write high-quality examples for every new task, domain, or question type. This creates a bottleneck that limits how quickly few-shot prompting can be deployed at scale.

COSP eliminates this bottleneck entirely. Instead of relying on humans to craft demonstrations, the model generates its own by answering questions multiple times with sampling (temperature > 0). When the same answer appears consistently across multiple independent samples, that consistency serves as a strong signal that the answer is correct. These high-consistency outputs then become the demonstrations for harder questions — the model effectively teaches itself by identifying what it already knows with confidence.

The elegance of COSP lies in its simplicity: it requires no labeled data, no human annotation, and no task-specific tuning. The model bootstraps its own demonstrations from the raw signal of self-agreement, turning the unreliability of any single sample into a strength when measured across many.

The Consistency Principle

If a model produces the same answer across multiple independent samples, that answer is very likely correct. High agreement = high confidence = reliable demonstration. This is the same logic behind Self-Consistency voting, but COSP uses it not just to pick the best answer — it uses the entire confident output as a reusable example for future questions. The consistency score becomes a quality filter that no human annotator could replicate at the same speed or scale.

How COSP Works

Four steps from zero-shot uncertainty to self-generated few-shot demonstrations

1

Sample Multiple Zero-Shot Responses

Run a set of questions through the model multiple times using temperature sampling (temperature > 0). Each question receives several independent response attempts, each potentially different due to the stochastic sampling process. This diversity is the raw material — by generating multiple outputs for the same input, you create a dataset from which consistency can be measured.

Example

“What is 15% of 240?” — sampled 5 times, producing responses: 36, 36, 36, 36, 38. Four out of five samples agree on 36.

2

Measure Consistency Across Samples

For each question, calculate a consistency score based on the degree of agreement among the sampled responses. Questions where most or all samples produce the same answer receive high consistency scores. Questions where samples diverge widely receive low scores. This consistency metric serves as a proxy for the model’s confidence — high agreement indicates the model “knows” the answer reliably, while low agreement indicates uncertainty or difficulty.

Example

The question “What is 15% of 240?” scores 4/5 (80%) consistency. A harder question like “Estimate the GDP impact of a 2% tariff increase” might score 1/5 (20%) — indicating the model is uncertain.

3

Select High-Consistency Outputs as Pseudo-Demonstrations

Filter for the questions and answers that achieved the highest consistency scores. These become your pseudo-demonstrations — complete question-answer pairs that the model produced confidently and correctly. Because they were generated by the model itself, they are naturally formatted in the model’s own style and vocabulary, making them particularly effective as few-shot examples. Select a diverse set to cover different aspects of the task.

Example

From 50 sampled questions, select the top 5 with the highest consistency scores. These become the demonstrations: “Q: What is 15% of 240? A: 36” and similar high-confidence pairs.

4

Use as Few-Shot Examples for Target Questions

Prepend the selected pseudo-demonstrations to the prompt for new or harder questions. The model now has concrete, reliable examples of the task format and expected reasoning — all generated without any human annotation. For questions that had low consistency in Step 2, this few-shot context often provides enough guidance to push the model toward the correct answer, improving accuracy on exactly the questions where the model previously struggled most.

Example

“Here are some examples: [high-consistency demonstrations]. Now answer: Estimate the GDP impact of a 2% tariff increase.” The previously uncertain question now benefits from the structured pattern established by confident examples.

See the Difference

How self-generated demonstrations improve accuracy without human effort

Approach

Send each question to the model with no examples. The model relies entirely on its pre-training knowledge and the instruction alone. No demonstrations, no patterns to follow, no calibration of expected output format.

Result

Moderate accuracy. The model answers easy questions correctly but struggles with ambiguous or harder questions where example patterns would help. Output format varies unpredictably across responses.

No examples, inconsistent format, moderate accuracy

VS

Approach

Sample multiple responses, identify high-consistency outputs, and use them as auto-generated few-shot demonstrations. Each target question is now preceded by reliable examples the model produced itself — no human labeling required.

Result

Higher accuracy across all question difficulty levels. The auto-generated demonstrations calibrate the model’s output format and reasoning approach. Hardest questions see the largest improvement because they benefit most from example patterns.

Auto-generated examples, consistent format, higher accuracy — zero human effort

COSP in Action

See how consistency-based demonstration selection works across different task types

Math Reasoning — Consistent Answers Become Demonstrations

Step 1: Sample Easy Questions

Ask the model “What is 25% of 80?” five times with temperature sampling. Results: 20, 20, 20, 20, 20. Perfect consistency (5/5). This question-answer pair becomes a pseudo-demonstration because the model is clearly confident in the answer.

Step 2: Apply to Hard Questions

Now when the model faces a harder question like “A store offers 15% off a $340 item, then an additional 10% loyalty discount on the reduced price. What is the final price?” — prepend the high-consistency demonstrations as examples. The model now has a concrete pattern for percentage calculations, improving its ability to chain the two discounts correctly rather than applying them simultaneously.

Classification — Majority-Vote Outputs as Examples

Step 1: Sample Clear-Cut Cases

Classify the sentiment of “This product completely transformed my morning routine — I cannot imagine going back!” across five samples. Results: Positive, Positive, Positive, Positive, Positive. Consistency score: 100%. This becomes a demonstration with a clear positive label.

Step 2: Apply to Ambiguous Cases

For an ambiguous review like “The build quality is excellent but the price feels steep for what you get” — which previously split 3/5 Mixed, 1/5 Positive, 1/5 Negative — prepend the high-consistency examples covering clear Positive, Negative, and Mixed sentiments. The model now has calibrated boundaries for each category, producing more consistent and accurate classifications on edge cases.

Multi-Step Reasoning — Reliable Chains as Templates

Step 1: Identify Reliable Reasoning Chains

Ask “If all roses are flowers and all flowers need water, do roses need water?” five times. All five samples produce: “Yes. All roses are flowers. All flowers need water. Therefore, roses need water.” The reasoning chain is consistent across all samples — not just the final answer, but the intermediate steps. This entire chain becomes a demonstration.

Step 2: Apply Chain Template to Complex Logic

For a harder syllogism like “Some managers are leaders. All leaders are communicators. Are all managers communicators?” — prepend the consistent chain-of-reasoning demonstrations. The model now follows the demonstrated pattern of stating premises explicitly and chaining inferences step by step, rather than jumping to a potentially incorrect conclusion. The template teaches not just what to answer, but how to reason.

When to Use COSP

Maximum value when human demonstrations are unavailable or impractical to create

Perfect For

No Labeled Examples Available

When you need few-shot demonstrations but lack the time, expertise, or resources to create them manually — COSP generates reliable examples from the model itself.

Scaling Across Many Tasks

When deploying prompts across dozens or hundreds of different task types where writing custom demonstrations for each would be prohibitively expensive.

Automated Pipelines

Production systems that process diverse inputs without human intervention — COSP enables self-adaptive few-shot prompting that requires no manual curation.

Domain-Specific Adaptation

When entering a new domain where you lack expertise to write demonstrations — let the model generate domain-appropriate examples from its own training knowledge.

Skip It When

High-Quality Human Examples Exist

If you already have carefully curated, expert-written demonstrations for your task, those will typically outperform auto-generated ones — human curation captures nuances that consistency alone cannot.

Very Simple Tasks

Tasks where zero-shot already achieves near-perfect accuracy — the overhead of sampling multiple responses and measuring consistency adds latency and cost with minimal accuracy gain.

Single-Use Queries

For one-off questions where the cost of generating multiple samples to build demonstrations exceeds the benefit — COSP’s value scales with reuse across multiple target questions.

Use Cases

Where COSP delivers the most value in practice

Automated QA Systems

Build question-answering pipelines that automatically generate their own few-shot examples from high-confidence answers, improving accuracy on harder questions without manual example curation.

Domain Adaptation

Adapt models to specialized domains — legal, medical, scientific — by letting the model generate domain-specific demonstrations from its own knowledge, avoiding the need for expert annotators.

Batch Processing

Process large batches of questions or tasks by first identifying high-consistency responses in a sampling pass, then using those as demonstrations for the remaining items in the batch.

Research Benchmarking

Evaluate model performance on new benchmarks without manually creating task-specific demonstrations — COSP provides a fair, automated baseline for few-shot evaluation across any task type.

Content Classification

Classify content at scale by first generating high-confidence labeled examples through consistency sampling, then using those labels as demonstrations for ambiguous or edge-case content.

Knowledge Extraction

Extract structured information from unstructured text by using consistently extracted examples as templates, improving extraction accuracy on documents with unusual formatting or complex layouts.

Where COSP Fits

COSP bridges zero-shot simplicity and few-shot effectiveness without human effort

Zero-Shot No Examples Direct instruction, no demonstrations

Self-Consistency Sample & Vote Multiple samples, majority-vote answer

COSP Self-Adaptive Consistent outputs become demonstrations

Auto-CoT Auto Chain-of-Thought Automated reasoning chain generation

From Voting to Teaching

Self-Consistency uses multiple samples to pick the best answer through majority voting. COSP takes this further: instead of just voting on answers, it harvests the confident outputs and repurposes them as teaching material for the model itself. This shift from “sample and discard” to “sample and reuse” is what makes COSP a self-adaptive technique rather than just a voting mechanism. The same consistency signal that tells you which answer is correct also tells you which outputs are reliable enough to serve as demonstrations.

Related Techniques

Explore techniques that share COSP’s foundation or extend its principles

Foundation Self-Consistency The technique that established consistency-based voting as a reliability signal. COSP extends this by reusing confident outputs as demonstrations rather than just selecting the best answer.

Complement Auto-CoT Automatically generates chain-of-thought demonstrations by clustering questions and sampling diverse reasoning chains — a complementary approach to automated demonstration generation.

Evolution Universal Self-Consistency Extends consistency-based selection to open-ended tasks where traditional majority voting is not possible — the model itself selects the most consistent response from its samples.

Automate Your Demonstrations

Put COSP principles into practice with our interactive tools, or explore the foundation techniques that make self-adaptive prompting possible.

Prompt Builder Self-Consistency

Let the Model Teach Itself

How COSP Works

Sample Multiple Zero-Shot Responses

Measure Consistency Across Samples

Select High-Consistency Outputs as Pseudo-Demonstrations

Use as Few-Shot Examples for Target Questions

See the Difference

Standard Zero-Shot

COSP

COSP in Action

When to Use COSP

Perfect For

Skip It When

Use Cases

Automated QA Systems

Domain Adaptation

Batch Processing

Research Benchmarking

Content Classification

Knowledge Extraction

Where COSP Fits

Related Techniques

Automate Your Demonstrations