Most companies sit on enough data to start a predictive maintenance program today. They just do not know which data matters.
The common assumption is that you need years of sensor telemetry, a purpose-built data warehouse, and a dedicated data science team before the first model runs. None of that is true. A 2024 report from Deloitte found that 68% of manufacturers who launched predictive maintenance programs started with data they already had in their ERP or maintenance management system, not purpose-built sensor networks.
Here is what you actually need, how long it takes, and what to collect first.
What are the minimum data requirements for a first model?
Three categories of data cover the minimum viable input for a working predictive maintenance model.
Failure history is the foundation. You need records of when equipment broke down, what failed, and how long the repair took. Twelve months is the practical floor. Less than that and the model cannot distinguish between normal variation and a genuine warning signal. The records do not need to be digital, even spreadsheets or scanned maintenance logs can be digitized and fed into a model, though that adds preprocessing time.
Operational context tells the model what the equipment was doing when it failed. This includes runtime hours, load levels, throughput, and any operator-recorded observations. A pump that fails after 900 hours of high-load operation carries a completely different signal than one that fails after 200 hours of normal use. Without operational context, failure history alone produces a model that predicts when equipment is old, not when it is actually at risk.
Equipment specifications anchor the model to physical reality. Manufacturer-rated service intervals, maximum load thresholds, and known failure modes let the model flag anomalies rather than just patterns. A compressor running at 94% of its rated capacity is not the same risk as one running at 94% of its actual observed maximum. Specs make that distinction possible.
With those three inputs, a team can train a model that reduces unplanned downtime by 20–30% compared to calendar-based maintenance schedules (McKinsey, 2024). That is the floor. Sensor data, which we will cover shortly, can push that figure to 40–50%.
| Data Category | Minimum Requirement | Why It Matters |
|---|---|---|
| Failure history | 12+ months of breakdown records | Trains the model to recognize pre-failure conditions |
| Operational logs | Runtime hours, load levels, throughput | Adds context that separates genuine risk from normal age |
| Equipment specs | Manufacturer thresholds, known failure modes | Grounds predictions in physical limits, not just patterns |
| Sensor data | Optional at start | Adds real-time signals; increases accuracy from ~25% to ~45% downtime reduction |
How does the system use historical maintenance records?
This is where most founders underestimate what they already have.
A maintenance record does not need to be formatted or clean to be useful. A technician's handwritten note saying "bearing noise before failure, replaced shaft seal" contains two signals: a symptom that preceded a failure and the component that actually gave out. When hundreds of those notes are digitized and structured, the model learns to associate early symptom types with specific failure modes.
The process works in two steps. First, the data is cleaned and labeled. Each record gets tagged with the equipment ID, the date, the symptom (if any), the failure type, and the repair action. Second, the model learns the sequence. It looks for records where symptoms appeared before failures and builds a probabilistic map: when equipment shows characteristic X followed by characteristic Y, failure of type Z typically occurs within a certain time window.
IBM's 2023 analysis of industrial maintenance programs found that structured historical records alone, without any real-time sensor input, can predict equipment failures with 70–80% accuracy when at least 18 months of records are available. That is precise enough to shift from reactive replacement to scheduled intervention, which is where most of the cost savings come from.
The practical limit is record quality. If your maintenance team logged failures under generic categories like "mechanical issue" without recording symptoms or components, the model has less to learn from. The fix is not to wait for better records; it is to start capturing better data now while the model trains on what exists. Within six months of improved logging, model accuracy improves noticeably.
Should I collect sensor data before choosing a platform?
No. Choosing a platform is the right first move, and here is why.
Sensor selection depends on what failure modes you are trying to predict. A vibration sensor on a rotating shaft is the right choice for bearing wear. A temperature sensor is the right choice for thermal overload. An acoustic sensor detects cavitation in pumps that vibration sensors miss entirely. Without a model telling you which failure modes are most likely and most costly, there is no principled way to choose which sensors to install.
The sequence that produces the least wasted spend: build the initial model on historical data, identify the two or three failure modes responsible for the majority of your unplanned downtime, then install sensors targeting exactly those failure modes on the equipment with the highest repair and downtime costs.
A 2023 study in the Journal of Manufacturing Systems found that targeted sensor deployment, meaning sensors selected based on a prior failure analysis, reduced data collection costs by 40% compared to broad sensor networks installed before any analysis was done. Broad sensor networks also generate data volumes that require significantly more storage and processing, adding infrastructure cost without proportional accuracy gains until the model matures.
If you already have sensors installed, start with those readings. But do not delay the program waiting for sensor coverage to feel complete. The historical data gets the model running. Sensors sharpen it over time.
How long does it take to gather enough training data?
For most industrial operations, you already have it.
If your equipment has been maintained for more than 18 months and you have been recording failures in any system, you have enough to begin. The data collection phase for a first predictive maintenance model is not about gathering new data. It is about extracting, cleaning, and structuring data that already exists.
That extraction and structuring process typically takes four to eight weeks, depending on how fragmented the records are. ERP systems like SAP or Oracle export maintenance history in formats that are reasonably structured. Work order systems from vendors like IBM Maximo or Infor usually do the same. Paper-based logs take longer because they require manual entry or optical character recognition to digitize.
If your records genuinely do not go back 12 months, either because the operation is new or historical records were lost, there are two options. Some teams run a six-month data collection period before model training, accepting a less accurate first model. Others supplement limited history with simulated failure data based on manufacturer specifications and industry failure rate benchmarks, then refine the model as real data accumulates. The second approach produces a deployable model faster.
| Starting Condition | Time to First Model | Expected Accuracy |
|---|---|---|
| 18+ months of structured digital records | 4–6 weeks | High (70–80% failure prediction rate) |
| 12–18 months of mixed digital/paper records | 6–10 weeks | Moderate (55–70%) |
| Under 12 months of records | 3–6 months data collection + 4 weeks modeling | Lower initially; improves with each new data point |
| No records, starting fresh | 6–12 months before meaningful model | Starts as rule-based, transitions to ML-based over time |
On cost: building a first predictive maintenance model with an AI-native team costs around $8,000–$12,000, covering data extraction, model training, a dashboard showing equipment risk scores, and integration with your existing maintenance workflow. A traditional analytics firm charges $40,000–$60,000 for the same scope. The gap comes from the same place as in software development: AI handles the repetitive parts of data processing and model scaffolding, which used to consume weeks of consultant hours on every engagement.
The ROI closes fast. Unplanned equipment downtime costs manufacturers an average of $260,000 per hour (Siemens, 2023). A model that prevents even one major unplanned failure per quarter pays for itself in the first month of operation.
If you want to know whether your current data is enough to start, Book a free discovery call. We will review what you have and tell you exactly what is missing, if anything.
