What data do I need to start a predictive maintenance program?

Most companies sit on enough data to start a predictive maintenance program today. They just do not know which data matters.

The common assumption is that you need years of sensor telemetry, a purpose-built data warehouse, and a dedicated data science team before the first model runs. None of that is true. A 2024 report from Deloitte found that 68% of manufacturers who launched predictive maintenance programs started with data they already had in their ERP or maintenance management system, not purpose-built sensor networks.

Here is what you actually need, how long it takes, and what to collect first.

What are the minimum data requirements for a first model?

Three categories of data cover the minimum viable input for a working predictive maintenance model.

Failure history is the foundation. You need records of when equipment broke down, what failed, and how long the repair took. Twelve months is the practical floor. Less than that and the model cannot distinguish between normal variation and a genuine warning signal. The records do not need to be digital, even spreadsheets or scanned maintenance logs can be digitized and fed into a model, though that adds preprocessing time.

Operational context tells the model what the equipment was doing when it failed. This includes runtime hours, load levels, throughput, and any operator-recorded observations. A pump that fails after 900 hours of high-load operation carries a completely different signal than one that fails after 200 hours of normal use. Without operational context, failure history alone produces a model that predicts when equipment is old, not when it is actually at risk.

Equipment specifications anchor the model to physical reality. Manufacturer-rated service intervals, maximum load thresholds, and known failure modes let the model flag anomalies rather than just patterns. A compressor running at 94% of its rated capacity is not the same risk as one running at 94% of its actual observed maximum. Specs make that distinction possible.

With those three inputs, a team can train a model that reduces unplanned downtime by 20–30% compared to calendar-based maintenance schedules (McKinsey, 2024). That is the floor. Sensor data, which we will cover shortly, can push that figure to 40–50%.

Data Category	Minimum Requirement	Why It Matters
Failure history	12+ months of breakdown records	Trains the model to recognize pre-failure conditions
Operational logs	Runtime hours, load levels, throughput	Adds context that separates genuine risk from normal age
Equipment specs	Manufacturer thresholds, known failure modes	Grounds predictions in physical limits, not just patterns
Sensor data	Optional at start	Adds real-time signals; increases accuracy from ~25% to ~45% downtime reduction

How does the system use historical maintenance records?

This is where most founders underestimate what they already have.

A maintenance record does not need to be formatted or clean to be useful. A technician's handwritten note saying "bearing noise before failure, replaced shaft seal" contains two signals: a symptom that preceded a failure and the component that actually gave out. When hundreds of those notes are digitized and structured, the model learns to associate early symptom types with specific failure modes.

The process works in two steps. First, the data is cleaned and labeled. Each record gets tagged with the equipment ID, the date, the symptom (if any), the failure type, and the repair action. Second, the model learns the sequence. It looks for records where symptoms appeared before failures and builds a probabilistic map: when equipment shows characteristic X followed by characteristic Y, failure of type Z typically occurs within a certain time window.

IBM's 2023 analysis of industrial maintenance programs found that structured historical records alone, without any real-time sensor input, can predict equipment failures with 70–80% accuracy when at least 18 months of records are available. That is precise enough to shift from reactive replacement to scheduled intervention, which is where most of the cost savings come from.

The practical limit is record quality. If your maintenance team logged failures under generic categories like "mechanical issue" without recording symptoms or components, the model has less to learn from. The fix is not to wait for better records; it is to start capturing better data now while the model trains on what exists. Within six months of improved logging, model accuracy improves noticeably.

Should I collect sensor data before choosing a platform?

No. Choosing a platform is the right first move, and here is why.

Sensor selection depends on what failure modes you are trying to predict. A vibration sensor on a rotating shaft is the right choice for bearing wear. A temperature sensor is the right choice for thermal overload. An acoustic sensor detects cavitation in pumps that vibration sensors miss entirely. Without a model telling you which failure modes are most likely and most costly, there is no principled way to choose which sensors to install.

The sequence that produces the least wasted spend: build the initial model on historical data, identify the two or three failure modes responsible for the majority of your unplanned downtime, then install sensors targeting exactly those failure modes on the equipment with the highest repair and downtime costs.

A 2023 study in the Journal of Manufacturing Systems found that targeted sensor deployment, meaning sensors selected based on a prior failure analysis, reduced data collection costs by 40% compared to broad sensor networks installed before any analysis was done. Broad sensor networks also generate data volumes that require significantly more storage and processing, adding infrastructure cost without proportional accuracy gains until the model matures.

If you already have sensors installed, start with those readings. But do not delay the program waiting for sensor coverage to feel complete. The historical data gets the model running. Sensors sharpen it over time.

How long does it take to gather enough training data?

For most industrial operations, you already have it.

If your equipment has been maintained for more than 18 months and you have been recording failures in any system, you have enough to begin. The data collection phase for a first predictive maintenance model is not about gathering new data. It is about extracting, cleaning, and structuring data that already exists.

That extraction and structuring process typically takes four to eight weeks, depending on how fragmented the records are. ERP systems like SAP or Oracle export maintenance history in formats that are reasonably structured. Work order systems from vendors like IBM Maximo or Infor usually do the same. Paper-based logs take longer because they require manual entry or optical character recognition to digitize.

If your records genuinely do not go back 12 months, either because the operation is new or historical records were lost, there are two options. Some teams run a six-month data collection period before model training, accepting a less accurate first model. Others supplement limited history with simulated failure data based on manufacturer specifications and industry failure rate benchmarks, then refine the model as real data accumulates. The second approach produces a deployable model faster.

Starting Condition	Time to First Model	Expected Accuracy
18+ months of structured digital records	4–6 weeks	High (70–80% failure prediction rate)
12–18 months of mixed digital/paper records	6–10 weeks	Moderate (55–70%)
Under 12 months of records	3–6 months data collection + 4 weeks modeling	Lower initially; improves with each new data point
No records, starting fresh	6–12 months before meaningful model	Starts as rule-based, transitions to ML-based over time

On cost: building a first predictive maintenance model with an AI-native team costs around $8,000–$12,000, covering data extraction, model training, a dashboard showing equipment risk scores, and integration with your existing maintenance workflow. A traditional analytics firm charges $40,000–$60,000 for the same scope. The gap comes from the same place as in software development: AI handles the repetitive parts of data processing and model scaffolding, which used to consume weeks of consultant hours on every engagement.

The ROI closes fast. Unplanned equipment downtime costs manufacturers an average of $260,000 per hour (Siemens, 2023). A model that prevents even one major unplanned failure per quarter pays for itself in the first month of operation.

If you want to know whether your current data is enough to start, Book a free discovery call. We will review what you have and tell you exactly what is missing, if anything.

Data Category

Minimum Requirement

Why It Matters

Failure history

12+ months of breakdown records

Trains the model to recognize pre-failure conditions

Operational logs

Runtime hours, load levels, throughput

Adds context that separates genuine risk from normal age

Equipment specs

Manufacturer thresholds, known failure modes

Grounds predictions in physical limits, not just patterns

Sensor data

Optional at start

Adds real-time signals; increases accuracy from ~25% to ~45% downtime reduction

Starting Condition

Time to First Model

Expected Accuracy

18+ months of structured digital records

4–6 weeks

High (70–80% failure prediction rate)

12–18 months of mixed digital/paper records

6–10 weeks

Moderate (55–70%)

Under 12 months of records

3–6 months data collection + 4 weeks modeling

Lower initially; improves with each new data point

No records, starting fresh

6–12 months before meaningful model

Starts as rule-based, transitions to ML-based over time

What data do I need to start a predictive maintenance program?

What are the minimum data requirements for a first model?

How does the system use historical maintenance records?

Should I collect sensor data before choosing a platform?

How long does it take to gather enough training data?

Related questions

How can hospitality businesses use predictive AI?

How do logistics companies use predictive AI for route planning and delivery estimates?

Can AI analyze open-ended survey responses at scale?

How do I analyze thousands of customer feedback messages with AI?

Announce in the next 28 days

What data do I need to start a predictive maintenance program?

What are the minimum data requirements for a first model?

How does the system use historical maintenance records?

Should I collect sensor data before choosing a platform?

How long does it take to gather enough training data?

Related questions

How can hospitality businesses use predictive AI?

How do logistics companies use predictive AI for route planning and delivery estimates?

Can AI analyze open-ended survey responses at scale?

How do I analyze thousands of customer feedback messages with AI?

Announce in the next 28 days