
Eat Your Own Dog Food

Daniel Lopes

Someone dropped this in our engineering Slack: "Does anyone have a good guide for which models to use when? Like, I'm scanning a web page and want to get a summary, classify it against our taxonomy—how do you actually choose? Or should I just ask ChatGPT to help me pick?"
If you're asking the same question, you're not alone. We're all making model choices based on gut feelings, outdated benchmarks, or whatever we used last time.
This guide is based on a mix of current benchmark data, official model documentation, and our own experience at GrowthX where we process ~10 billion tokens per month across Anthropic and OpenAI combined.
Our take: At GrowthX, we constantly switch models as the landscape shifts. Benchmarks don't tell the full story—real-world performance varies dramatically by use case. Claude 4.5 Sonnet dominates our SVG generation workflows and excels in all our agents using llm-as-a-judge, interpreting ambiguous feedback and exiting loops faster. It also has better "taste" for writing quality. GPT-5 with Codex, on the flip side, excels at generating polished designs with Shadcn/Tailwind—it has real "taste" for UI design. With Claude 4.5 Haiku's launch, we're testing the switch to Anthropic for more workflows where speed and cost matter.
Why Haiku 4.5 might change the game: What was frontier performance five months ago (Claude Sonnet 4) is now available at 1/3 the cost and 2x the speed. With 73.3% on SWE-bench at $1/$5 per million tokens, it beats GPT-4o (30.8%) and approaches Claude 4.5 Sonnet (77.2%) and GPT-5 (74.9%) at a fraction of the latency and price.
Skip the details? Here's the TL;DR:
| Your Use Case | Recommended Model | Why |
|---|---|---|
| High-volume, cost-sensitive tasks | GPT-4o-mini | $0.15/$0.60 per 1M tokens crushes competition on price |
| API code generation (simple) | Claude 4.5 Haiku | $1/$5 per 1M tokens, 73.3% SWE-bench, blazing fast |
| API code generation (complex) | Claude 4.5 Sonnet | 77.2% SWE-bench (82% with test-time compute)—highest score available |
| Agentic workflows & automation | Claude 4.5 Sonnet | #1 for command-line, GUI automation, multi-tool orchestration |
| Real-time chatbots | Claude 4.5 Haiku | Near-frontier performance, 2-4x faster than Sonnet |
| Structured output/JSON | GPT-5 | 100% schema compliance guaranteed |
| Large document processing | Claude 4.5 Sonnet | 200K standard, 1M beta context window |
| Batch processing | GPT-4o-mini or Claude 4.5 Sonnet | Both offer 50% batch discounts |
Critical gotchas:
Here's what you actually need to know as of October 2025:
Late 2025 gives you a handful of production-ready models through APIs. Each has distinct strengths and costs. You're choosing between proven options with established benchmarks.
What's available:
The tradeoff that matters:
Every model selection boils down to cost versus performance versus speed.
But here's what changed in 2025: according to Stanford HAI, top-tier models now perform within 0.7% of each other on standard benchmarks. Down from 4.9% in 2023.
This convergence means implementation factors matter more than marginal performance differences.
Stop obsessing over which model scores 2% higher on standardized tests. Focus on pricing, rate limits, API reliability, and how well each model integrates with your existing infrastructure.
Benchmarks tell you one story. Production tells you another.
Researchers test coding ability by having models fix real bugs and add features from actual GitHub projects using SWE-bench Verified. Here's what the numbers say:
But here's where it gets interesting.
A controlled study by METR research found that experienced developers saw a 19% productivity slowdown when using early-2025 AI tools. This study has been widely shared, but we strongly disagree with its conclusions.
Our experience: Developer productivity with AI is tightly related to how well engineers understand LLMs and their tooling of choice (Claude Code, Cursor, Codex, etc). Within our own organization, the delta between engineers ramping up on the LLM stack (prompt engineering, deep understanding of tooling like creating sub-agents in Claude Code, etc.) versus fully ramped engineers is multiples of productivity—not marginal gains. This is another example where papers and benchmarks don't paint the full picture at all.
The reality: Benchmarks tell one story. Your workflow and expertise tell another.
You're choosing a model to power your IDE assistant—the thing that's writing code alongside you, managing files, running commands.
Winner: Claude 3.5 Sonnet
Why? It's the most admired model among developers (51.2% preference) despite not leading on benchmarks. It excels at agentic coding and tool orchestration—the stuff that matters when you're actually building features.
Why not GPT-5? It wins on raw benchmark scores but struggles with variable latency ("extremely slow compared to 4.1 or 4o") and prompt sensitivity issues (simple prompts flagged as policy violations).
For an IDE assistant, responsiveness and reliability beat raw performance.
You're calling an LLM API to generate code programmatically—autocomplete, code review, test generation, refactoring suggestions.
For routine completion and debugging:
For complex refactoring or architectural work:
For multi-step feature building:
Language-specific considerations:
Headline benchmark numbers: modern models achieve impressive scores on mathematical reasoning tasks.
The practical story: more nuanced.
Multi-step problem solving depends on how you structure your requests:
That trade-off between consistency and peak capability? It matters more than absolute performance numbers in production.
For summarization, classification, and extraction—the bread-and-butter tasks most applications need—the model landscape has commoditized. Performance differences for typical document processing are minimal across major providers.
But implementation factors determine your actual costs and performance. Here's what matters:
For batch document processing:
For large context processing:
For structured output reliability:
For speed:
Bottom line for text processing: Use GPT-4o-mini for cost-sensitive volume work. Use Claude 4.5 Sonnet when you need large context windows or prompt caching benefits.
Structured output: getting the model to return data in a specific format like JSON that your application can reliably process.
Function calling: having the model automatically use tools or APIs based on conversation context.
Newer models promise improved capabilities. Reality depends on your specific schema requirements.
Simple schemas work reliably across all models:
{
"sentiment": "positive",
"confidence": 0.95,
"categories": ["product", "feedback"]
}Complex schemas expose limitations fast:
{
"items": [
{
"id": "string",
"metadata": {
"tags": ["required", "array"],
"conditionalField": "only if type=premium"
}
}
],
"additionalProperties": false // Strict validation
}Issues with complex schemas:
additionalProperties: false often violatedError handling: Some models fail gracefully with validation errors. Others produce malformed JSON that crashes your parser.
Automated tool usage reliability varies dramatically between providers. Some excel at using multiple tools simultaneously but struggle with error handling. Others provide rock-solid single tool usage but can't coordinate multiple systems.
Test with your specific use cases. Don't rely on general claims.
Building agents that can use tools, navigate interfaces, and complete multi-step tasks autonomously? This is where model differences matter most.
Claude 4.5 Sonnet dominates agentic use cases:
Why Claude wins for agents:
The research shows Claude maintains consistent performance across different problem-solving approaches without requiring special configurations. GPT-5 has variable latency depending on reasoning depth—unpredictable for autonomous workflows.
When to use what:
Critical consideration: Developer reports note that "Claude Sonnet 4.5 is sensitive to prompt structure" for agent mode. Set high-level contract once, keep task prompts short.
Performance capabilities matter. But speed determines user experience.
Time-to-First-Token (TTFT)—the delay before a model starts generating output:
When speed matters, the choice is clear-cut. Interactive applications with humans in the loop need sub-second response times. Background batch processing can tolerate higher latency for better quality or lower costs.
Real-time applications benefit from cascading: start with fast, cheap models for initial filtering. Use expensive models only when necessary.
Infrastructure optimization varies by provider. Some models benefit significantly from caching. Others show minimal improvement. Streaming responses can dramatically improve perceived performance—but implementation complexity varies across APIs.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| GPT-5 | $10.00 | $30.00 | 128K |
| Claude 4.5 Sonnet | $3.00 | $15.00 | 200K |
| GPT-4o | $2.50 | $10.00 | 128K |
| Claude 4.5 Haiku | $1.00 | $5.00 | 200K |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| GPT-4o-mini | $0.15 | $0.60 | 128K |
Optimization features can dramatically reduce costs.
Prompt caching: Claude gives you 90% cost reduction on cached content at $0.30 per million tokens. Highly effective for applications that reuse system prompts or reference materials.
Batch processing: Non-real-time use cases achieve 50% cost reductions across all major providers. Essential for high-volume background tasks like document analysis or data processing pipelines.
Efficient context management: Smart use of extended context is often more cost-effective than making multiple API calls—especially when you need the model to synthesize information across large documents.
Processing 10,000 support tickets monthly (500 input tokens, 200 output tokens):
Daily chatbot with 5,000 users (1,000 input tokens, 300 output tokens):
These assume no optimization. With prompt caching and batch processing, costs drop 50-90%.
Sticker price tells part of the story. Factor in:
A context window determines how much text a model can process at once.
Maximum capabilities:
Real-world considerations matter more than theoretical limits:
For most applications, structured approaches beat massive context windows.
Retrieval Augmented Generation (RAG)—having a research assistant find relevant information first, then using that to generate responses—proves more reliable than stuffing everything into a massive context window.
When to use what:
This architectural choice impacts both performance and cost at scale.
Theoretical capabilities diverge from practical reliability. Models claim support for complex automated tool usage and structured output. Real-world performance varies significantly.
JSON Schema Mode support is inconsistent. GPT-5 introduces native JSON Schema Mode with guaranteed 100% schema compliance. Claude 4.5 models demonstrate high reliability for structured output with 0% error rate on internal benchmarks.
Test thoroughly with your specific schema requirements.
Using multiple tools simultaneously adds complexity that introduces failures. Many applications achieve better reliability with sequential tool usage—even if it increases latency.
Rate limiting matters for production. The tier-based systems mean spending more money literally buys you higher limits.
OpenAI Tiers (based on cumulative spending):
Anthropic limits are more conservative but offer batch processing discounts that offset restrictions for non-real-time applications.
How to actually choose:
Processing thousands of requests daily with tight budget constraints? GPT-4o-mini is the practical choice. The performance differences don't justify the dramatic cost increase for high-volume scenarios.
Optimization strategies:
Simple completion and debugging? GPT-4o-mini provides excellent value.
Complex architectural decisions or need the highest coding performance? Claude 4.5 Sonnet leads with 77.2% SWE-bench score (82% with test-time compute), beating GPT-5's 74.9%—and without the latency issues.
By task:
For most teams, Claude 4.5 Sonnet provides the best balance of performance, speed, and cost for complex coding tasks.
Running multi-step analytical workflows where errors compound? Premium models like GPT-5 or Claude 4.5 Sonnet typically justify their cost.
But don't default to expensive models automatically. Test whether cheaper models with better prompting achieve similar results.
When premium models are essential:
Critical trade-off for GPT-5: Superior reasoning capability comes with variable latency depending on reasoning depth. For batch analytical jobs this is acceptable. For interactive analysis, Claude 4.5 Sonnet's predictable response times (~1.98s TTFT) may be more practical despite lower peak performance.
"Real-time" means different things depending on your use case. Here's what actually matters:
Users expect natural conversation flow. 1-2 second delays are acceptable.
Recommended:
Users notice delays above half a second. Every 100ms matters.
Recommended:
Code editors, document collaboration, design tools. Users tolerate brief waits for intelligent suggestions.
Recommended:
Real-time gaming, live event processing, high-frequency trading.
Reality check: Current LLMs don't meet these requirements reliably. Consider:
Sometimes the best model choice is no model at all. LLMs aren't the right solution when:
Deterministic logic required:
Latency is critical (<100ms):
Compliance and auditability:
Extreme volume with tight budgets:
Data privacy constraints:
When simple solutions work:
Standard benchmarks test isolated capabilities under ideal conditions. Production faces different challenges.
Prompt sensitivity varies dramatically between models. Some require extensive prompt engineering to achieve consistent results. Others work reliably with simple, direct instructions.
Consistency versus peak performance: Some models deliver exceptional results on their best attempts but struggle with reliability. Others provide "good enough" results consistently.
Community consensus:
Model comparison pipelines require methodical approaches:
Evaluate models across accuracy, cost per request, average latency, and error rates. Don't rely solely on published benchmarks.
Implement routing logic: simple requests to cost-effective models, complex requests to premium models. This cascading approach works well for content moderation pipelines, customer service escalation, and document processing workflows.
Implementation patterns: Try cheaper models first, escalate to premium models when confidence scores fall below acceptable thresholds.
Essential metrics go beyond simple success rates:
Detecting degradation: Establish baselines and alert thresholds. Model performance can change with provider updates. Monitor continuously:
The model selection decisions in this guide assume you have proper observability. Without tracing and evaluation systems, you can't know if your model choice is actually working.
What proper observability looks like:
This is a deep topic that deserves its own detailed guide.
Premature optimization: Many teams choose expensive models before understanding their actual requirements. Start with baseline models. Upgrade based on measured needs.
Ignoring compound latency: Multi-step workflows can make fast models seem slow when chained together.
Underestimating context needs: Often leads to model changes mid-development when extended context becomes necessary.
API reliability varies between providers. Some handle rate limiting gracefully. Others fail hard. Build appropriate retry logic and fallback strategies.
Version management: Critical as models update. Pin specific versions for production stability.
Error handling: Include exponential backoff for rate limits and fallback model calls for persistent failures.
Context window traps: Processing large documents inefficiently can dramatically increase costs.
Development versus production: Usage patterns differ significantly. Budget for testing, experimentation, and gradual rollout phases.
The AI model landscape evolves rapidly.
Version management should balance stability with improvement opportunities:
Track updates across multiple sources:
1. Should I start with the most powerful/expensive model?
No. Start with cost-effective baseline models like GPT-4o-mini. Upgrade only when specific limitations impact your application. Performance convergence means expensive models often provide minimal benefit for typical tasks.
2. How do I benchmark for my specific use case?
Create evaluation datasets from your actual data. Generic benchmarks rarely predict performance on your specific problem domain. Test with representative prompts, edge cases, and failure scenarios.
3. When is a cheaper model "good enough"?
When cost savings outweigh quality improvements. When downstream processes can handle occasional errors. Many applications benefit more from processing 10x more data with a cheaper model than perfect results on limited data.
4. Should I use different models for different tasks?
Yes—when complexity and volume justify the implementation overhead. Simple routing logic can optimize cost-performance across different request types.
5. How do I handle model inconsistency?
Implement retry logic with different prompts. Use majority voting for critical decisions. Cascade to more reliable models for inconsistent results.
6. Can I mix providers in one application?
Yes, but consider the operational complexity. Different APIs, rate limits, and pricing structures increase implementation and monitoring overhead.
The explosion of AI model options creates both opportunities and complexity.
The most successful implementations focus on practical engineering principles. Not chasing benchmark leaderboards.
Three principles:
1) Start with baselines, not optimization. Choose cost-effective models that meet your basic requirements. Optimize based on real usage patterns and measured limitations.
2) Let real data drive decisions. Production constraints often matter more than benchmarks. Rate limits, latency requirements, and integration complexity determine success more than marginal accuracy improvements.
3) Build adaptable systems. Don't bet everything on today's leader. The AI model landscape will continue evolving rapidly. Build systems that can adapt to new models and providers.
The best model choice delivers value consistently while maintaining flexibility for tomorrow's opportunities.