The question of when to use prompt chaining versus a single carefully constructed prompt comes up in almost every project that moves past proof-of-concept. The default answer from many practitioners is "chain when tasks are complex" — but that's not precise enough to be useful. Complex by what measure?
We ran a structured comparison across five task categories to get a clearer picture. The results are more nuanced than the default guidance suggests, and some of them surprised us.
How we tested
For each task category, we created a set of 50 test cases with known correct outputs (or human-rated quality scores where "correct" isn't binary). We tested each task with a single-shot prompt and with a two-to-four step chain, evaluating output quality, consistency across runs, and failure rate. Models tested: GPT-4o and Claude 3.5 Sonnet. Both showed similar patterns, so results are reported as averages.
Task category results
Document summarization (long inputs)
Single-shot won here, with meaningfully better quality scores and lower variance. For summarization tasks, the model's ability to maintain awareness of the full document in a single pass outweighed any benefits from decomposition. Chains tended to lose contextual connections that single-pass summaries preserved.
Exception: documents over approximately 80,000 tokens where context limits force decomposition regardless.
Structured data extraction
Chains won by a meaningful margin. Breaking extraction into sub-tasks — first identify relevant sections, then extract specific fields, then validate against expected formats — produced substantially lower error rates than asking the model to do all three simultaneously. The validation step in particular caught errors that single-shot consistently missed.
Multi-criteria evaluation and scoring
Chains won again. When evaluation requires applying multiple independent criteria, having the model focus on one criterion at a time produced more consistent scoring than simultaneous multi-criteria evaluation. Single-shot evaluation showed systematic biases toward certain criteria dominating others.
Creative generation with constraints
Mixed results, task-dependent. For tasks with many hard constraints (format, tone, required elements, exclusions), chains helped by separating constraint compliance from content generation. For tasks with few constraints, single-shot was generally sufficient and faster.
Code generation and review
Single-shot performed better for standard generation tasks. Chains added value for complex refactoring tasks where separating "identify issues" from "generate fixes" produced more reliable results — but for greenfield generation, decomposition tended to produce inconsistent interfaces between chain steps.
The cost dimension
Chains cost more — generally 2x to 4x the token count of single-shot equivalents. For most production applications at volume, that's a meaningful consideration. Chains are worth the cost when they produce materially better outputs in your specific use case. They're not worth it when the improvement is marginal or when the failure modes they introduce (context loss between steps, error accumulation) outweigh the benefits.
A chain where each step validates the previous one's output is more expensive than single-shot but substantially more reliable. A chain that just breaks a task into pieces without validation gains less than it costs.
Practical recommendations
Start single-shot. Measure output quality carefully. Introduce chaining when you have evidence that single-shot is producing specific, identifiable failures that decomposition would address. Don't chain by default because "the task is complex."
When you do chain, include a validation or critique step before accepting final output. Chains without validation layers can accumulate errors in ways that are harder to detect than single-shot failures.
The evaluation framework we used for this comparison is documented separately — ask us if you want the test cases or scoring rubrics.