Few-Shot Optimization

Example Selection

The examples you choose for few-shot prompting matter more than the number you provide. Strategic selection — matching examples by similarity, diversity, and task relevance — can improve performance by 20–30% over random choices.

Technique Context: 2021

Introduced: Example Selection was formalized by Liu et al. in 2021, who demonstrated that the choice of in-context examples can swing few-shot accuracy by more than 20 percentage points — often mattering more than the number of examples provided. Their work showed that retrieving semantically similar examples using nearest-neighbor methods dramatically outperformed random selection, establishing a new paradigm for how practitioners approach few-shot learning.

Modern LLM Status: The core principles of Example Selection remain highly relevant and actively practiced in modern prompting workflows. While Claude, GPT-4, and Gemini are more robust to example quality than earlier models, strategic selection still yields measurable gains on classification, extraction, and reasoning tasks. The technique has evolved into embedding-based retrieval pipelines and forms the foundation for more advanced methods like KNN Prompting and Vote-k selection. For production systems processing thousands of queries, automated example selection is now considered standard practice.

The Core Insight

Not All Examples Are Created Equal

When you provide examples in a few-shot prompt, the model doesn’t just learn the task format — it learns the distribution of your examples. If your examples are semantically similar to the test input, the model receives a stronger signal about what kind of output is expected. If they are from a completely different domain, the model must bridge a wider gap, and accuracy suffers.

Example Selection turns this insight into a method. Rather than grabbing random demonstrations from a pool, you deliberately choose examples that maximize relevance to the specific query at hand. This can mean selecting by semantic similarity (choosing examples closest in meaning to the input), by diversity (covering different aspects of the task space), or by task-specific criteria (matching complexity, format, or domain).

Think of it like choosing study materials for an exam. Reviewing practice problems that closely resemble the actual test questions is far more effective than studying random exercises from unrelated chapters — even if you study the same total number of problems.

Why Selection Beats Quantity

Liu et al.’s landmark finding was striking: three well-chosen examples consistently outperformed ten randomly selected ones across multiple benchmarks. The model extracts more useful patterns from a small set of highly relevant demonstrations than from a large set of loosely related ones. This means smarter selection can actually reduce token costs while improving accuracy — a rare win-win in prompt engineering.

The Example Selection Process

Three stages from candidate pool to optimized demonstration set

1

Build a Candidate Pool

Assemble a diverse collection of labeled input-output pairs that represent your task. This pool should cover different categories, edge cases, complexity levels, and domains. The richer and more varied your candidate pool, the better your selection algorithms can find optimal matches for any given query.

Example

For a sentiment classification task, collect 50–100 labeled reviews spanning positive, negative, neutral, and mixed sentiments across product categories like electronics, food, clothing, and services.

2

Score and Rank Candidates

For each incoming query, evaluate every candidate example against a selection criterion. The most common approach is semantic similarity — compute embedding vectors for both the query and each candidate, then rank by cosine similarity. Alternatively, you can score by diversity (maximize coverage of the label space), complexity matching (align example difficulty with query difficulty), or a hybrid of multiple criteria.

Example

Given a query about a “wireless headphone battery issue,” embed the query and compare it against all candidates. The top matches might be reviews about Bluetooth earbuds, wireless speakers, and laptop battery problems — all semantically close to the input.

3

Assemble and Order the Prompt

Select the top-k candidates and arrange them in your prompt. Research shows that example ordering matters — models exhibit recency bias, giving more weight to examples closer to the query. Place your most representative or most similar example last, immediately before the test input. Ensure the final set balances similarity with enough diversity to avoid biasing the model toward a single output pattern.

Example

Select 3 examples: one covering a different product category (diversity), one with mixed sentiment (boundary case), and one highly similar to the query (placed last for recency). Insert them in this order before the query in your few-shot prompt.

See the Difference

Why strategically selected examples outperform random ones

Random Examples

Prompt

Task: Classify this tech support ticket.

Examples:
“Great pizza!” → POSITIVE
“The movie was boring” → NEGATIVE
“Nice weather today” → POSITIVE

Query: “My laptop won’t connect to WiFi after the update”

Response

NEGATIVE
Model applied sentiment labels instead of support categories — examples from the wrong domain taught the wrong task entirely.

Wrong domain, wrong label space, misleading signal
VS

Strategically Selected

Curated Examples

Task: Classify this tech support ticket.

Examples:
“Bluetooth stopped working after restart” → CONNECTIVITY
“App crashes when I open settings” → SOFTWARE_BUG
“Can’t print to network printer” → CONNECTIVITY

Query: “My laptop won’t connect to WiFi after the update”

Response

CONNECTIVITY
Semantically similar examples from the same domain taught the correct label space and category boundaries.

Same domain, correct labels, strong pattern match

Natural Language Works Too

While structured frameworks and contextual labels are powerful tools, LLMs are exceptionally good at understanding natural language. As long as your prompt contains the actual contextual information needed to create, answer, or deliver the response you’re looking for — the who, what, why, and constraints — the AI can produce complete and accurate results whether you use a formal framework or plain conversational language. But even in 2026, with the best prompts, verifying AI output is always a necessary step.

Example Selection in Action

Three strategies for choosing optimal demonstrations

Strategy

Choose examples whose inputs are most semantically similar to the test query. Compute embeddings for all candidates and the query, then select the nearest neighbors by cosine similarity.

Application

Query: “The battery drains too fast on my new tablet”

Selected examples (by similarity):
1. “My phone battery dies within 3 hours of full charge” → HARDWARE — BATTERY
2. “Laptop shuts down at 30% battery remaining” → HARDWARE — BATTERY
3. “Wireless earbuds won’t hold charge after update” → HARDWARE — BATTERY

Result: The model correctly classifies the query as HARDWARE — BATTERY with high confidence. Each example shares semantic features with the query: portable devices, battery complaints, and the same category label.

Strategy

Choose examples that cover different categories, formats, and edge cases to show the model the full output space. Ensure at least one example per expected label to prevent the model from being biased toward any single category.

Application

Query: “Where is my package? I ordered it two weeks ago”

Selected examples (by diversity):
1. “I want my money back for this defective item” → REFUND
2. “How do I reset my account password?” → ACCOUNT
3. “When will my order ship?” → TRACKING

Result: The model correctly classifies as TRACKING. By seeing one example from each category, it understands the full label space and can distinguish between REFUND, ACCOUNT, and TRACKING intents rather than defaulting to the most-represented category.

Strategy

Choose examples that match the specific characteristics of the task — complexity level, output format, reasoning depth, or domain conventions. For a complex legal analysis, use complex legal examples; for a simple label task, use concise, clear-cut demonstrations.

Application

Query: “Notwithstanding the foregoing, in the event of force majeure as defined in Section 12.1, neither party shall be held liable for delays in performance”

Selected examples (by task match):
1. “Subject to Section 4.2(b), the indemnifying party shall hold harmless...” → INDEMNIFICATION with CONDITIONS — References external section, establishes conditional obligation
2. “Except as otherwise provided herein, all warranties express or implied are disclaimed...” → WARRANTY DISCLAIMER with EXCEPTION — Broad exclusion with carve-out language

Result: EXCEPTION CLAUSE with LIABILITY LIMITATION — The model produces a nuanced analysis matching the complexity of the input because the examples demonstrated the expected depth of legal reasoning.

When to Use Example Selection

Best when few-shot performance depends on demonstration quality

Perfect For

Classification at Scale

When processing thousands of inputs through a few-shot pipeline, automated example retrieval ensures every query gets the most relevant demonstrations rather than a static, one-size-fits-all set.

Domain-Specific Tasks

Legal, medical, financial, or technical tasks where generic examples fail — similarity-based selection ensures demonstrations share the vocabulary, structure, and conventions of the target domain.

Multi-Label or Many-Class Problems

When the output space has dozens of possible labels, diversity-based selection ensures the model sees the full category landscape instead of a biased subset.

Token-Constrained Environments

When context window limits force you to use fewer examples, strategic selection maximizes the signal from each demonstration — three curated examples can outperform ten random ones.

Skip It When

Zero-Shot Tasks

If the model performs well without any examples, adding a selection pipeline introduces unnecessary complexity. Test zero-shot performance first before investing in example curation.

Homogeneous Input Streams

When all your queries are nearly identical in structure and domain, a single static set of well-chosen examples works just as well as dynamic retrieval — the overhead of per-query selection is not justified.

Creative or Open-Ended Generation

For brainstorming, creative writing, or exploratory tasks where there is no single correct output format, rigid example selection can constrain the model rather than help it.

Use Cases

Where strategic example selection delivers measurable impact

Customer Intent Classification

Route support tickets to the correct department by selecting examples from similar product lines, complaint types, and language patterns for each incoming query.

Document Extraction

Extract structured data from invoices, contracts, or forms by selecting template examples that match the document layout, field structure, and formatting conventions of each input.

Medical Coding

Assign diagnosis or procedure codes to clinical notes by retrieving examples with similar symptoms, specialties, and coding patterns from a curated pool of annotated records.

Security Log Triage

Classify security alerts by severity and type by matching each new event against examples from similar attack vectors, network segments, and historical incident patterns.

Content Moderation

Detect policy violations in user-generated content by selecting boundary-case examples that teach the model to distinguish between acceptable and prohibited content in context-specific scenarios.

Financial Sentiment

Analyze earnings calls, analyst reports, and market commentary by selecting examples from the same sector, time period, and financial instrument type to capture domain-specific language nuances.

Where Example Selection Fits

From random demonstrations to intelligent retrieval-augmented selection

Few-Shot Learning Random Examples Static demonstrations chosen by hand
Example Selection Strategic Retrieval Similarity and diversity-based curation
KNN Prompting Nearest-Neighbor Lookup Embedding-based retrieval at inference
Vote-k Selection Consensus Filtering Model-guided example quality scoring
Combine Similarity with Diversity

The most effective selection strategies blend both approaches. Start by retrieving a larger set of similar candidates (for example, the top 10 by embedding distance), then apply a diversity filter to ensure the final set of 3–5 examples covers different labels, formats, or edge cases. This hybrid approach captures the benefits of semantic relevance while avoiding the tunnel vision that pure similarity selection can create.

Optimize Your Examples

Apply example selection strategies to your few-shot prompts or explore our tools for building better demonstrations.