ChatGPT launched in late November 2022, and by January 2023 every founder with a document library was asking the same question: can I point this thing at my own data and get a chatbot that actually knows my business?
The answer is yes. But "training on company data" turns out to mean two very different things in practice, and picking the wrong one will cost you months of work and a meaningful slice of your budget. This article explains both approaches, tells you which data to prepare and how, and gives you a realistic benchmark for how much is enough.
What does training on company data mean now that ChatGPT exists?
Before 2022, training a chatbot meant building a model from scratch on labeled conversational data. That required tens of millions of examples, GPU clusters, and a machine learning team. The phrase has completely changed meaning.
What founders actually want today is usually one of two things. They want the chatbot to know things the base model does not know, like their internal processes, product catalog, or support history. Or they want it to behave a certain way, like answering in a specific tone or staying strictly within a defined scope.
Neither of those goals requires training a model from scratch. Models like the one powering ChatGPT already understand language. What they lack is your specific context. Getting them that context is the real problem, and the method you choose has enormous practical consequences.
Gartner estimated in late 2022 that over 80% of enterprise AI projects fail before reaching production. The most common reason is teams selecting an approach that fits a theoretical ideal rather than the data they actually have on hand.
How does fine-tuning differ from feeding documents at query time?
Imagine the base model as a well-read generalist. Fine-tuning sends that generalist back to school on your specific material, adjusting the model's internal parameters so the new knowledge becomes part of how it responds. Retrieval does something different: it leaves the generalist exactly as they are but hands them a set of relevant documents to read every time they answer a question.
Fine-tuning is the right choice when you want to change how the model behaves, not just what it knows. If you need it to always respond in a particular format, follow strict rules about what it will and will not answer, or consistently use terminology from your industry, fine-tuning trains that behavior in. OpenAI documented in 2022 that fine-tuning on domain-specific data can improve accuracy on targeted tasks by 30 to 40% compared to the base model.
Retrieval, often called RAG in technical circles, suits situations where your information changes frequently or where you have a large body of documents that need to be searchable. You store your documents in a search index. When a user asks a question, the system finds the two or three most relevant passages and sends them to the model alongside the question. The model reads those passages and answers from them.
| Approach | Best for | Updating knowledge | Data needed | Cost to build |
|---|---|---|---|---|
| Fine-tuning | Behavior change, tone, format control | Requires retraining | 1,000–100,000 labeled Q&A pairs | $8,000–$20,000 |
| Retrieval | Knowledge access, large document libraries | Re-index only changed documents | Any volume of clean text | $5,000–$12,000 |
| Both combined | Behavior plus knowledge at scale | Retrain plus re-index | Large labeled dataset plus documents | $18,000–$35,000 |
Western AI agencies typically quote $40,000–$80,000 for a retrieval-based chatbot on company documents, and $80,000–$150,000 for a fine-tuned solution. An AI-native team with experienced global engineers delivers the same output at roughly one-fifth those figures.
One thing retrieval cannot do: make the model smarter. It makes the model better informed. The model still does the reasoning. If your documents contain conflicting information, the model will sometimes pick the wrong passage and answer confidently from it. The quality of the source material sets a ceiling on the quality of the answers.
Do I need to clean and structure my data first?
Yes, and most founders underestimate how much this step matters.
For retrieval, the quality of the search index determines the quality of the answers. A document that covers seven unrelated topics on a single page produces poor search results because the system cannot identify which section is relevant to a given question. Short, focused documents with clear titles outperform long, sprawling ones. A 2022 study from Pinecone found that splitting documents into 200 to 400 word sections improved retrieval accuracy by roughly 25% compared to indexing whole pages.
For fine-tuning, the bar is higher. The model learns patterns from your examples, so if your training data contains inconsistent answers to similar questions, the model will be inconsistent too. A dataset skewed toward one type of question will produce a model that handles that type well and struggles with everything else.
Before writing any code, run through this check on your documents. Remove content that is no longer accurate: a chatbot trained on outdated pricing or discontinued products will confidently give wrong answers. Split long documents into focused sections so each chunk addresses one clear topic. Check for contradictions, and where two documents give different answers to the same question, decide which is correct and update or remove the other. Identify gaps by mapping the ten questions your users will most commonly ask and confirming that your document set actually contains clear answers to all ten.
For fine-tuning specifically, you need labeled pairs where each example is a question paired with the correct answer. If your data exists as raw text rather than structured Q&A pairs, converting it is a manual process. Budget one to two weeks of preparation time for every 10,000 examples you plan to use.
An IBM study from 2022 found that data quality issues account for 30 to 80% of AI project failures, with poor data costing companies an average of $12.9 million per year. The projects that fail are usually not failing because the model is wrong. They are failing because the documents the model is reading are wrong.
How much data is enough for a useful chatbot?
The answer depends on the approach, but the number is smaller than most founders expect.
For retrieval, there is no minimum threshold. A functional chatbot can be built on 20 clean, well-structured documents. What matters more than volume is coverage: do your documents contain direct answers to the questions your users will actually ask? A well-structured library of 50 documents will outperform a poorly organized one of 5,000.
For fine-tuning, the practical floor is around 1,000 labeled examples. Below that, the model does not see enough variation to generalize, and you risk the model becoming very good at your specific training examples while performing poorly on anything slightly different. At 1,000 examples you can reliably shift specific behaviors. At 10,000 to 50,000 examples you can shift overall style and domain knowledge in a meaningful way.
OpenAI's own 2022 documentation recommends starting fine-tuning experiments with as few as 50 to 100 high-quality examples to test whether the approach works before investing in a larger dataset. That is sensible: run a small experiment, measure quality, then scale data collection if the results justify it.
| Data volume | What you can achieve | Recommended approach |
|---|---|---|
| Under 500 documents or labeled examples | Basic Q&A on a narrow topic | Retrieval only |
| 500–5,000 labeled Q&A pairs | Consistent tone, format, narrow domain | Fine-tuning for behavior, retrieval for knowledge |
| 5,000–50,000 labeled Q&A pairs | Strong domain specialization | Fine-tuning plus retrieval |
| 50,000+ labeled Q&A pairs | Near-expert performance in a specific area | Full fine-tuning pipeline |
The mistake most teams make is waiting until they have a large, clean dataset before building anything. A retrieval chatbot on well-organized documents is a production-quality product. It can go live in weeks, generate real conversations with real users, and those conversations become the labeled data you eventually use for fine-tuning. Start with retrieval, collect user feedback, and upgrade when the data supports it.
Timespade builds both retrieval and fine-tuning pipelines as part of its Generative AI practice. If your chatbot also needs to connect to a data system, surface predictions, or live inside a product you are building, that is one team and one contract, not three vendors trying to coordinate across Slack channels.
The first conversation is free. Walk through your use case on a discovery call and get a clear assessment of which approach fits the data you already have. Book a free discovery call
