Pick the wrong AI model and you will spend three months rebuilding something that should have worked on day one. That is not a scare tactic. A 2022 Gartner survey found that 85% of AI projects fail to reach production, and the leading cause is choosing a model that does not fit the problem. The second leading cause is underestimating the infrastructure to run it.
Most founders treat model selection like picking a SaaS tool: read a few comparison articles, go with the name they recognize, and hope it works. That approach fails because AI models are not interchangeable. A model built for generating text cannot detect fraud. A model that classifies images cannot summarize contracts. And a model that works beautifully in a demo notebook can collapse under real traffic.
This article is the decision framework Timespade uses with clients who want to add AI to their products without wasting months on the wrong approach.
What categories of AI models exist and what are they good at?
AI models fall into a handful of families, and each family solves a fundamentally different type of problem. Mixing them up is the most common mistake founders make.
Language models process text. They power chatbots, summarization tools, translation features, and search systems that understand meaning rather than just matching words. OpenAI's GPT-3 and GPT-3.5 are the most recognized examples as of late 2022, but open alternatives like BLOOM (176 billion parameters, released by BigScience in mid-2022) and Meta's OPT models exist as well.
Computer vision models process images and video. They handle everything from scanning receipts to detecting defective products on a factory line to reading license plates. According to Grand View Research (2022), the computer vision market hit $14.1 billion in 2022 and is growing at 19.6% annually.
Predictive models work with structured, tabular data: spreadsheets, databases, transaction logs. They forecast demand, flag fraudulent transactions, predict which customers will cancel, and recommend products. These are the workhorses of business AI and have been in production far longer than language or vision models.
Speech models convert between audio and text. They power voice assistants, call transcription systems, and accessibility features.
| Model family | What it processes | Common product uses | Example models (late 2022) |
|---|---|---|---|
| Language models | Text | Chatbots, summarization, search, content generation | GPT-3.5, BLOOM, Cohere, AI21 Jurassic-2 |
| Vision models | Images, video | Receipt scanning, defect detection, medical imaging | ResNet, YOLO, Vision Transformer (ViT) |
| Predictive models | Structured/tabular data | Churn prediction, demand forecasting, fraud detection | XGBoost, LightGBM, CatBoost, custom neural nets |
| Speech models | Audio | Voice assistants, transcription, accessibility | Whisper (OpenAI), DeepSpeech |
| Multimodal models | Text + images together | Visual Q&A, image captioning | CLIP, Flamingo |
A 2022 McKinsey survey found that the most adopted AI capability across industries was still predictive analytics at 41% adoption, followed by natural language processing at 35% and computer vision at 32%. Generative models like GPT-3.5 are newer and getting attention, but the majority of revenue-generating AI in production today runs on prediction.
How does a model's architecture determine what tasks it can perform?
Think of architecture as the internal wiring of a model. Two models can have the same number of parameters and completely different capabilities because they are wired differently.
Transformer-based models, the architecture behind GPT-3 and BERT, are built to process sequences where the order of items matters. That makes them strong at language and increasingly at images (through Vision Transformers). Convolutional architectures excel at spatial patterns, which is why they dominated image processing for a decade. Recurrent architectures process sequences one step at a time, which makes them useful for time-series data like stock prices or sensor readings, though transformers are replacing them here too.
What matters to you as a founder is not the architecture name but the practical consequence. If your product needs to understand a paragraph of text, you need a transformer-based language model. If your product needs to identify objects in photographs, you likely need a convolutional model or a Vision Transformer. If your product needs to predict next month's sales from two years of historical data, a gradient-boosted tree model (like XGBoost) will probably outperform a neural network while being cheaper and faster.
Stanford's 2022 AI Index report measured that transformer models consumed 6x more compute than convolutional models for equivalent image classification accuracy. That compute gap translates directly to your server bill. Architecture is not an abstract engineering decision. It is a cost decision.
What tradeoffs come with larger vs. smaller models?
Bigger is not always better. This is counterintuitive because the AI headlines in late 2022 are all about scale: 175 billion parameters, 540 billion parameters, trillion-token training sets. But size creates real tradeoffs that hit your budget and your users.
Larger models are slower. GPT-3 with 175 billion parameters takes roughly 350 milliseconds per API call for a short completion (OpenAI benchmarks, 2022). A fine-tuned DistilBERT model with 66 million parameters answers the same classification question in under 10 milliseconds. That 35x speed difference determines whether your user interface feels instant or sluggish.
Larger models are more expensive to run. Estimates from AI infrastructure provider Anyscale (2022) put the inference cost of a 175-billion-parameter model at roughly $0.02-$0.06 per 1,000 tokens. A small fine-tuned model running on a single GPU costs a fraction of a cent. At 10 million API calls per month, that difference is the gap between a $5,000 bill and a $200 bill.
Larger models are more general. A 175-billion-parameter model can write poetry, answer trivia, translate languages, and summarize legal documents. A small model fine-tuned on your specific task often outperforms the large model on that one task while failing at everything else.
| Factor | Large model (100B+ params) | Small fine-tuned model (<1B params) |
|---|---|---|
| Latency per request | 200-500ms | 5-20ms |
| Monthly inference cost at scale | $3,000-$8,000 | $100-$500 |
| Accuracy on general tasks | High | Low |
| Accuracy on your specific task | Good | Often equal or better |
| Setup effort | Low (API call) | Medium (requires training data) |
| Data privacy | Data leaves your servers | Runs on your own infrastructure |
A 2022 Google Research paper ("Scaling Data-Constrained Language Models") found that smaller models trained on high-quality, domain-specific data matched models 10x their size on narrow tasks. The practical lesson: if your product does one thing with AI, you probably do not need the biggest model available.
How do I benchmark models against my specific use case?
Public benchmark scores are almost useless for product decisions. A model that tops the SuperGLUE leaderboard might perform poorly on your customer support tickets because your data looks nothing like the benchmark dataset. The only benchmark that matters is how the model performs on your data, with your users, under your latency constraints.
Here is the evaluation process Timespade runs for clients building AI features.
Collect 200-500 real examples from your domain. These are actual inputs your product will receive: customer messages, product images, transaction records, whatever the model will process. Label them with the correct output. If you are building a support ticket classifier, that means 200-500 real tickets labeled with the right category. This dataset is your ground truth.
Run each candidate model against those examples and measure three things. Accuracy: what percentage of outputs are correct? Latency: how long does each response take? Cost: what will this cost per 1,000 requests at production volume?
A 2022 survey by Weights & Biases found that 67% of ML teams that skipped real-data evaluation regretted it within three months. The teams that built a proper evaluation dataset before choosing a model reported 40% fewer production failures.
| Evaluation criteria | What to measure | Acceptable threshold (typical) |
|---|---|---|
| Accuracy | % correct on your labeled dataset | 85%+ for classification, 70%+ for generation |
| Latency (p95) | 95th percentile response time | Under 200ms for user-facing, under 2s for background |
| Cost per 1,000 requests | Total inference spend | Depends on margin, but track it from day one |
| Failure modes | What does the model get wrong? | Errors must be non-catastrophic for your domain |
| Data format fit | Does the model handle your input format natively? | Should not require heavy preprocessing |
Do not skip the failure mode analysis. A model that is 92% accurate but confidently wrong on the other 8% in ways that embarrass your brand is worse than a model that is 88% accurate and says "I'm not sure" when it does not know. Timespade has audited AI features where a 4% error rate on a medical triage tool would have generated dangerous recommendations. The model scored great on benchmarks. It was still the wrong choice.
What licensing restrictions apply to open-source vs. commercial models?
Licensing is where many AI projects hit an invisible wall six months in. A founder picks an open-source model because it is free, builds the product around it, and then discovers the license prohibits commercial use or requires releasing proprietary improvements.
OpenAI's models (GPT-3, GPT-3.5) are commercial APIs with usage-based pricing. You pay per token, and OpenAI owns the model. You cannot host it yourself, modify it, or use it offline. If OpenAI changes pricing or terms, you absorb the change. As of December 2022, GPT-3.5 API pricing runs about $0.002 per 1,000 tokens.
Open-source models come with wildly different licenses. Meta's OPT-175B was released under a noncommercial license, meaning you can experiment with it but cannot ship a product on top of it without separate permission. BLOOM uses the Responsible AI License (RAIL), which allows commercial use but restricts certain applications. Stability AI's Stable Diffusion uses a permissive Creative ML OpenRAIL-M license that explicitly permits commercial use.
Hugging Face's 2022 analysis of its model hub found that only 38% of open-source models with more than 1 billion parameters carried licenses that unambiguously permit commercial use. The rest had noncommercial clauses, unclear terms, or no license at all.
| License type | Can you sell a product with it? | Can you modify it? | Can you keep modifications private? | Example models |
|---|---|---|---|---|
| Commercial API (OpenAI, Cohere) | Yes, under terms of service | No access to weights | N/A | GPT-3.5, Cohere Generate |
| Permissive open-source (Apache 2.0, MIT) | Yes | Yes | Yes | Some Hugging Face models |
| Copyleft (GPL-family) | Yes, but modifications must also be open-sourced | Yes | No | Rare for large models |
| Noncommercial research only | No | Yes | Yes (for research) | OPT-175B, some academic models |
| Responsible AI License (RAIL) | Yes, with use restrictions | Yes | Varies | BLOOM, Stable Diffusion |
Before building anything on a model, have someone with legal expertise read the license. Timespade's engineering team flags licensing constraints during the discovery phase so founders do not get three months into a build only to find out the model cannot be used commercially.
How does model selection affect latency and infrastructure needs?
The model you choose determines your server bill. This is where the gap between a demo and a production product becomes expensive.
A large language model with 175 billion parameters requires specialized GPU servers to run. As of late 2022, renting a single NVIDIA A100 GPU from AWS costs approximately $3.06 per hour, or about $2,200 per month. A model of that size typically needs multiple A100s running in parallel. Running your own instance of GPT-3-scale model on AWS would cost roughly $8,000-$15,000 per month in GPU rental alone, before any engineering time.
Using an API instead of hosting yourself is simpler but creates a dependency. OpenAI's API had four significant outages in Q3 2022 alone (tracked by DownDetector and OpenAI's own status page). If your product's core feature relies on that API, every outage is your outage. Your users do not care that the problem is at OpenAI. They see your product as broken.
Smaller models change the math entirely. A fine-tuned BERT-base model (110 million parameters) runs on a single mid-range GPU that costs about $350 per month. Response times drop below 20 milliseconds. You host it yourself, so uptime is under your control.
| Deployment approach | Monthly infra cost | Latency | Uptime control | Best for |
|---|---|---|---|---|
| Commercial API (OpenAI, Cohere) | $500-$5,000 (usage-based) | 200-800ms | Low, depends on provider | Prototyping, moderate-scale products |
| Self-hosted large model (100B+) | $8,000-$15,000 | 150-400ms | High | High-volume or privacy-sensitive products |
| Self-hosted small model (<1B) | $200-$500 | 5-30ms | High | Narrow tasks, low-latency requirements |
| Edge deployment (on-device) | Hardware cost only | 1-10ms | Full | Offline-capable or ultra-low-latency |
Timespade helps clients model these infrastructure costs during the planning phase, before a single line of code is written. Picking the wrong hosting approach can mean the difference between a $500 monthly server bill and a $15,000 one for the same feature.
Should I use a pre-trained model or train one from scratch?
Training a model from scratch is almost never the right choice for a startup. The cost is prohibitive and the results are rarely better than fine-tuning an existing model.
Training GPT-3 from scratch cost an estimated $4.6 million in compute alone (Lambda Labs, 2022). Even training a modest model with 1 billion parameters from scratch runs $100,000-$300,000, not counting the ML engineering team to manage the process. A 2022 Epoch AI study found that training compute costs for state-of-the-art models had been doubling every 6-10 months since 2018.
Fine-tuning takes a pre-trained model and adjusts it with your specific data. The cost drops by orders of magnitude. Fine-tuning GPT-3 on a custom dataset through OpenAI's API costs a few hundred dollars. Fine-tuning an open-source model like BERT on your own GPU takes a few hours and costs under $50 in compute. The resulting model often outperforms the base model on your specific task because it has learned the patterns in your data.
There are only two scenarios where training from scratch makes sense. If your data is in a language or domain so specialized that no existing model has seen anything like it (rare). Or if you need complete control over the model's behavior for regulatory reasons and cannot accept the black-box nature of a pre-trained foundation.
For everyone else, the path is clear: pick a pre-trained model, fine-tune it on your data, evaluate it on your test set, and ship it. A 2022 Stanford HAI report found that 92% of commercial AI deployments used pre-trained models with some degree of fine-tuning rather than training from scratch.
What happens when a model provider deprecates the version I rely on?
This is not hypothetical. Google deprecated several AutoML Vision model versions in 2022, forcing teams to migrate on short timelines. OpenAI has already retired older GPT-3 engine variants (ada, babbage, curie naming changes) with limited advance notice. Hugging Face community models get abandoned by their maintainers regularly.
When a model you depend on disappears, you face a forced migration. Your team has to find a replacement model, re-run your evaluation benchmarks, adjust your integration code, and test everything before the old version goes offline. If you built your product tightly coupled to one model's specific behavior, quirks included, the migration can take weeks.
A 2022 O'Reilly survey of ML practitioners found that 29% had experienced a model deprecation that required unplanned engineering work. The median time to migrate was 3-6 weeks. For startups with small teams, that is a quarter of an engineer's time for a month and a half, doing work that adds zero new features.
The financial impact is real too. Unplanned migrations at a Western agency run $15,000-$30,000 in engineering time. Timespade handles the same migration for $4,000-$8,000 because the team structure and global talent economics make engineering hours less expensive without sacrificing quality. The developers running your migration have the same experience and credentials. They just live in cities where senior engineers earn $25,000-$50,000 per year instead of $160,000-$200,000.
How do I future-proof against model changes?
You cannot predict which models will exist in two years. You can build your product so that swapping models is a configuration change instead of a rewrite.
The principle is called abstraction, and in plain terms it means: build a layer between your product and the AI model so they never touch each other directly. Your product asks the layer "classify this support ticket." The layer talks to whatever model is currently active. When you need to swap models, you update the layer. Your product does not change.
Timespade builds this abstraction into every AI feature. The cost to include it is minimal during initial development, roughly 5-10% more engineering time. The cost to add it later, after your product is tightly wired to a specific model, is 3-5x higher. A 2022 ThoughtWorks Technology Radar report listed "model abstraction layers" as a technique that had moved from experimental to mature practice.
There are four concrete steps. Build an adapter between your product code and the model so swapping requires changing one configuration, not rewriting business logic. Keep your evaluation dataset current so you can benchmark a new model in hours, not weeks. Monitor model performance in production because accuracy can drift over time as your users' inputs change. Set up alerts that flag when the model's error rate crosses a threshold you define.
Teams that skip the abstraction layer save 2-3 days during the first build and lose 3-6 weeks during the first model swap. Timespade has seen this pattern across multiple client projects where the original developer chose speed over architecture. The rebuild cost is always higher than the upfront investment would have been.
How do I narrow down to two or three models?
The evaluation process above can feel overwhelming if you are looking at dozens of models on Hugging Face or comparing five different API providers. Here is how to cut the list fast.
Start with your constraint, not your wish list. If your data cannot leave your servers (healthcare, finance, legal), eliminate all API-only models immediately. That single filter removes GPT-3.5, Cohere, and most commercial offerings. If you need responses under 50 milliseconds, eliminate every model above 1 billion parameters. If your budget for AI infrastructure is under $1,000 per month, eliminate self-hosted large models.
Once you have applied constraints, you typically have 3-5 candidates. Run your 200-500 example evaluation against each one. Rank them on accuracy first, latency second, cost third. If two models score within 2% accuracy of each other, pick the cheaper or faster one. A 2% accuracy difference is rarely noticeable to users. A 10x latency difference always is.
| Step | Action | What it eliminates |
|---|---|---|
| 1 | Define data residency requirements | API-only models (if data cannot leave your servers) |
| 2 | Set latency ceiling | All models too large to meet your response time target |
| 3 | Set monthly infrastructure budget | Self-hosted large models if budget is under $2,000/mo |
| 4 | Check licensing for commercial use | Noncommercial and ambiguous-license models |
| 5 | Run evaluation on your real data | Models that underperform on your specific task |
| 6 | Compare finalists on cost and latency | The more expensive or slower model when accuracy is tied |
After this process, most teams land on two realistic options: a commercial API for speed-to-market, and an open-source fine-tuned model for long-term cost control. Many founders start with the API to validate the feature with real users, then migrate to a self-hosted model once they have proven the feature is worth the investment. That migration, handled by an experienced team, typically costs $5,000-$8,000 and takes 2-3 weeks. At a Western agency, the same work runs $15,000-$25,000 and takes 4-6 weeks.
Timespade has shipped AI features across language models, computer vision, predictive analytics, and recommendation engines. The team does not default to one model or one vendor. The recommendation comes from your data, your constraints, and your budget.
If you have an AI feature in mind but are not sure which model fits, the fastest path is a 30-minute conversation with someone who has done this across dozens of products. Book a free discovery call
