COSP
Consistency-based Self-adaptive Prompting — bootstrap reliable few-shot demonstrations from the model’s own confident, high-agreement outputs, eliminating the need for human-written examples.
Introduced: COSP (Consistency-based Self-adaptive Prompting) was introduced by Wan et al. in 2023 to address one of the most persistent bottlenecks in few-shot prompting: the need for high-quality, human-written demonstrations. Creating effective demonstrations requires domain expertise, careful formatting, and significant manual effort — a process that scales poorly across diverse tasks and domains. COSP solves this by having the model generate its own demonstrations, then using consistency across multiple samples as a reliability signal to filter for only the most trustworthy outputs to serve as pseudo-demonstrations.
Modern LLM Status: COSP remains an active and relevant technique, particularly valuable in automated prompt engineering pipelines and production systems that need domain-specific demonstrations at scale. As models have grown more capable, the quality of self-generated demonstrations has improved correspondingly, making COSP even more effective than when it was first proposed. The core principle — that consistency signals reliability — has influenced subsequent work in self-adaptive prompting, automatic demonstration generation, and LLM-as-judge paradigms where models evaluate their own outputs.
Let the Model Teach Itself
Few-shot prompting is one of the most effective ways to improve LLM performance — providing input-output examples in the prompt gives the model concrete patterns to follow. But creating those demonstrations has always required human effort: someone must write high-quality examples for every new task, domain, or question type. This creates a bottleneck that limits how quickly few-shot prompting can be deployed at scale.
COSP eliminates this bottleneck entirely. Instead of relying on humans to craft demonstrations, the model generates its own by answering questions multiple times with sampling (temperature > 0). When the same answer appears consistently across multiple independent samples, that consistency serves as a strong signal that the answer is correct. These high-consistency outputs then become the demonstrations for harder questions — the model effectively teaches itself by identifying what it already knows with confidence.
The elegance of COSP lies in its simplicity: it requires no labeled data, no human annotation, and no task-specific tuning. The model bootstraps its own demonstrations from the raw signal of self-agreement, turning the unreliability of any single sample into a strength when measured across many.
If a model produces the same answer across multiple independent samples, that answer is very likely correct. High agreement = high confidence = reliable demonstration. This is the same logic behind Self-Consistency voting, but COSP uses it not just to pick the best answer — it uses the entire confident output as a reusable example for future questions. The consistency score becomes a quality filter that no human annotator could replicate at the same speed or scale.
How COSP Works
Four steps from zero-shot uncertainty to self-generated few-shot demonstrations
Sample Multiple Zero-Shot Responses
Run a set of questions through the model multiple times using temperature sampling (temperature > 0). Each question receives several independent response attempts, each potentially different due to the stochastic sampling process. This diversity is the raw material — by generating multiple outputs for the same input, you create a dataset from which consistency can be measured.
“What is 15% of 240?” — sampled 5 times, producing responses: 36, 36, 36, 36, 38. Four out of five samples agree on 36.
Measure Consistency Across Samples
For each question, calculate a consistency score based on the degree of agreement among the sampled responses. Questions where most or all samples produce the same answer receive high consistency scores. Questions where samples diverge widely receive low scores. This consistency metric serves as a proxy for the model’s confidence — high agreement indicates the model “knows” the answer reliably, while low agreement indicates uncertainty or difficulty.
The question “What is 15% of 240?” scores 4/5 (80%) consistency. A harder question like “Estimate the GDP impact of a 2% tariff increase” might score 1/5 (20%) — indicating the model is uncertain.
Select High-Consistency Outputs as Pseudo-Demonstrations
Filter for the questions and answers that achieved the highest consistency scores. These become your pseudo-demonstrations — complete question-answer pairs that the model produced confidently and correctly. Because they were generated by the model itself, they are naturally formatted in the model’s own style and vocabulary, making them particularly effective as few-shot examples. Select a diverse set to cover different aspects of the task.
From 50 sampled questions, select the top 5 with the highest consistency scores. These become the demonstrations: “Q: What is 15% of 240? A: 36” and similar high-confidence pairs.
Use as Few-Shot Examples for Target Questions
Prepend the selected pseudo-demonstrations to the prompt for new or harder questions. The model now has concrete, reliable examples of the task format and expected reasoning — all generated without any human annotation. For questions that had low consistency in Step 2, this few-shot context often provides enough guidance to push the model toward the correct answer, improving accuracy on exactly the questions where the model previously struggled most.
“Here are some examples: [high-consistency demonstrations]. Now answer: Estimate the GDP impact of a 2% tariff increase.” The previously uncertain question now benefits from the structured pattern established by confident examples.
See the Difference
How self-generated demonstrations improve accuracy without human effort
Standard Zero-Shot
Send each question to the model with no examples. The model relies entirely on its pre-training knowledge and the instruction alone. No demonstrations, no patterns to follow, no calibration of expected output format.
Moderate accuracy. The model answers easy questions correctly but struggles with ambiguous or harder questions where example patterns would help. Output format varies unpredictably across responses.
COSP
Sample multiple responses, identify high-consistency outputs, and use them as auto-generated few-shot demonstrations. Each target question is now preceded by reliable examples the model produced itself — no human labeling required.
Higher accuracy across all question difficulty levels. The auto-generated demonstrations calibrate the model’s output format and reasoning approach. Hardest questions see the largest improvement because they benefit most from example patterns.
COSP in Action
See how consistency-based demonstration selection works across different task types
Ask the model “What is 25% of 80?” five times with temperature sampling. Results: 20, 20, 20, 20, 20. Perfect consistency (5/5). This question-answer pair becomes a pseudo-demonstration because the model is clearly confident in the answer.
Now when the model faces a harder question like “A store offers 15% off a $340 item, then an additional 10% loyalty discount on the reduced price. What is the final price?” — prepend the high-consistency demonstrations as examples. The model now has a concrete pattern for percentage calculations, improving its ability to chain the two discounts correctly rather than applying them simultaneously.
Classify the sentiment of “This product completely transformed my morning routine — I cannot imagine going back!” across five samples. Results: Positive, Positive, Positive, Positive, Positive. Consistency score: 100%. This becomes a demonstration with a clear positive label.
For an ambiguous review like “The build quality is excellent but the price feels steep for what you get” — which previously split 3/5 Mixed, 1/5 Positive, 1/5 Negative — prepend the high-consistency examples covering clear Positive, Negative, and Mixed sentiments. The model now has calibrated boundaries for each category, producing more consistent and accurate classifications on edge cases.
Ask “If all roses are flowers and all flowers need water, do roses need water?” five times. All five samples produce: “Yes. All roses are flowers. All flowers need water. Therefore, roses need water.” The reasoning chain is consistent across all samples — not just the final answer, but the intermediate steps. This entire chain becomes a demonstration.
For a harder syllogism like “Some managers are leaders. All leaders are communicators. Are all managers communicators?” — prepend the consistent chain-of-reasoning demonstrations. The model now follows the demonstrated pattern of stating premises explicitly and chaining inferences step by step, rather than jumping to a potentially incorrect conclusion. The template teaches not just what to answer, but how to reason.
When to Use COSP
Maximum value when human demonstrations are unavailable or impractical to create
Perfect For
When you need few-shot demonstrations but lack the time, expertise, or resources to create them manually — COSP generates reliable examples from the model itself.
When deploying prompts across dozens or hundreds of different task types where writing custom demonstrations for each would be prohibitively expensive.
Production systems that process diverse inputs without human intervention — COSP enables self-adaptive few-shot prompting that requires no manual curation.
When entering a new domain where you lack expertise to write demonstrations — let the model generate domain-appropriate examples from its own training knowledge.
Skip It When
If you already have carefully curated, expert-written demonstrations for your task, those will typically outperform auto-generated ones — human curation captures nuances that consistency alone cannot.
Tasks where zero-shot already achieves near-perfect accuracy — the overhead of sampling multiple responses and measuring consistency adds latency and cost with minimal accuracy gain.
For one-off questions where the cost of generating multiple samples to build demonstrations exceeds the benefit — COSP’s value scales with reuse across multiple target questions.
Use Cases
Where COSP delivers the most value in practice
Automated QA Systems
Build question-answering pipelines that automatically generate their own few-shot examples from high-confidence answers, improving accuracy on harder questions without manual example curation.
Domain Adaptation
Adapt models to specialized domains — legal, medical, scientific — by letting the model generate domain-specific demonstrations from its own knowledge, avoiding the need for expert annotators.
Batch Processing
Process large batches of questions or tasks by first identifying high-consistency responses in a sampling pass, then using those as demonstrations for the remaining items in the batch.
Research Benchmarking
Evaluate model performance on new benchmarks without manually creating task-specific demonstrations — COSP provides a fair, automated baseline for few-shot evaluation across any task type.
Content Classification
Classify content at scale by first generating high-confidence labeled examples through consistency sampling, then using those labels as demonstrations for ambiguous or edge-case content.
Knowledge Extraction
Extract structured information from unstructured text by using consistently extracted examples as templates, improving extraction accuracy on documents with unusual formatting or complex layouts.
Where COSP Fits
COSP bridges zero-shot simplicity and few-shot effectiveness without human effort
Self-Consistency uses multiple samples to pick the best answer through majority voting. COSP takes this further: instead of just voting on answers, it harvests the confident outputs and repurposes them as teaching material for the model itself. This shift from “sample and discard” to “sample and reuse” is what makes COSP a self-adaptive technique rather than just a voting mechanism. The same consistency signal that tells you which answer is correct also tells you which outputs are reliable enough to serve as demonstrations.
Related Techniques
Explore techniques that share COSP’s foundation or extend its principles
Automate Your Demonstrations
Put COSP principles into practice with our interactive tools, or explore the foundation techniques that make self-adaptive prompting possible.