A payment gets submitted. Before the buyer finishes reading the confirmation screen, your system has already decided whether to approve or block it. That decision takes less than 300 milliseconds, and it is not a rule. It is a score produced by a statistical model that has seen millions of transactions and learned which patterns precede fraud.
This is real-time fraud scoring: a machine learning system that assigns every transaction a risk number between 0 and 1, where 0 means almost certainly legitimate and 1 means almost certainly fraudulent. The business acts on that number by approving, blocking, or sending the transaction to a human reviewer.
Understanding how that score gets produced is not an academic exercise. It tells you where false positives come from, why good customers get declined, and what data you need before a fraud model is worth building.
What happens during the milliseconds between transaction and score?
The moment a transaction arrives, it triggers a chain of lookups that runs faster than any human could track. The system is not waiting for all the data to assemble before it thinks. It queries multiple data sources simultaneously and assembles the score from whatever returns in time.
Here is the sequence in plain terms. A customer submits a payment. The system immediately reads the transaction details: amount, merchant category, currency, and the time of day. At the same time, it pulls the customer's history from a fast-access database, checks whether the device fingerprint has been seen before, and compares the geographic location of the current request against where this customer usually transacts.
All of that happens in parallel, in roughly 50–100 milliseconds. The model then takes those assembled signals and runs them through a calculation that produces the risk score. The entire end-to-end window, from payment submission to score returned, has to complete in under 300 milliseconds for the user to experience no perceptible delay. Visa's internal benchmarks put their authorization decisions at around 130 milliseconds on average.
Missing the window does not mean the transaction is blocked. It means the system falls back to a simpler rule set, which is faster but less accurate. Latency is a first-class concern in fraud architecture, not an afterthought.
How does the scoring model weigh different risk signals?
Not all signals carry equal weight, and the weights are learned from data rather than set by hand. The model trains on historical transactions that are labeled as fraudulent or legitimate, and it learns which combinations of signals were most predictive of each outcome.
The strongest signals in most fraud models fall into a few categories.
Velocity is usually the most informative. A card that has made three purchases in four countries in the past two hours is far more suspicious than the same card making a single purchase in its home city. The model does not just look at the current transaction in isolation. It looks at the rate and pattern of recent activity.
Geographic deviation matters similarly. The model computes how far the current transaction's location is from the customer's typical locations. A $40 grocery purchase from the same zip code as always scores very differently from a $40 purchase from a country the customer has never transacted in.
Device and behavioral signals are the third major category. Whether the browser fingerprint is new, whether the typing speed on the checkout form was unusually fast (a signal of automated input), and whether the session came from a known proxy or VPN. McKinsey research from 2021 found that behavioral signals like typing cadence and mouse movement patterns improved fraud detection rates by 15–20% over models using transaction data alone.
Amount is a signal, but a weaker one than most people expect. The model cares more about whether the amount is unusual for this specific customer than whether it is large in absolute terms. A $5,000 transaction from a wholesale buyer who makes $5,000 purchases weekly is low risk. The same amount from an account that has never exceeded $200 is high risk.
| Signal Category | Example | Why It Matters |
|---|---|---|
| Velocity | 3 transactions in 4 countries in 2 hours | Fraud rings exploit cards quickly before they are cancelled |
| Geographic deviation | Purchase from a country the customer has never used | Card data stolen and sold is often used abroad first |
| Device fingerprint | New device for an established account | Account takeover often starts with a new device login |
| Behavioral biometrics | Unusually fast form completion | Bots and scripts do not type like humans |
| Amount vs. history | $800 purchase on an account that averages $30 | Fraudsters test cards with small amounts first, then escalate |
| Merchant category | First-time purchase at a high-risk merchant type | Certain categories attract card-not-present fraud |
The model combines all of these through an ensemble of algorithms, typically gradient-boosted trees, where multiple weak predictions are combined into one strong one. The output is a score, not a rule. The threshold at which that score triggers a block or a review is set by the business based on its tolerance for false positives versus missed fraud.
What infrastructure does real-time scoring require?
A fraud model that scores transactions accurately in testing but cannot return results in under 300 milliseconds in production is not a fraud system. It is an experiment. The infrastructure constraints are as defining as the model itself.
The most critical requirement is a feature store: a database that holds precomputed customer and account attributes, updated continuously as new transactions arrive, and readable in under 10 milliseconds. Without it, the scoring service would have to compute every signal from scratch for each transaction, which is far too slow. The feature store trades storage for speed. It keeps the answer to "what is this customer's average transaction amount over the past 30 days" ready to read at any moment, rather than computing it on demand.
The scoring model itself is served as a separate microservice, meaning it runs independently and can be called by any other part of the system. This matters because fraud scoring is not the only thing that needs risk signals. Checkout flows, account creation, and login attempts all benefit from the same model, and a standalone service can power all of them without duplicating the logic.
To deliver scores in under 300 milliseconds globally, the scoring service needs to run in multiple regions so that a payment originating in Singapore is not scored by a server sitting in Virginia. Latency from geographic distance alone would blow the time budget.
According to a 2021 report from Juniper Research, the global cost of payment fraud was $32.39 billion, with card-not-present fraud accounting for the majority of losses. Companies investing in real-time scoring infrastructure at this scale are protecting against a measurable revenue leak, not a theoretical risk.
How do I tell whether my fraud scores are well-calibrated?
A score of 0.7 should mean that roughly 70% of transactions with that score turn out to be fraudulent. If it does not, the scores are not calibrated, and the thresholds you set based on them will not behave as expected. Calibration is distinct from accuracy, and it is the piece most teams check last.
The practical test is straightforward. Group all your scored transactions into buckets by score range: 0.0–0.1, 0.1–0.2, and so on. Within each bucket, look at what percentage actually turned out to be fraud. If your 0.7–0.8 bucket contains transactions that were only 20% fraudulent in reality, your model is overconfident. It is assigning high scores to transactions that are not actually that risky, which means your block threshold will catch too many legitimate purchases.
Two metrics tell most of the story. The false positive rate is the percentage of legitimate transactions your model blocks. Industry benchmarks put an acceptable false positive rate at under 1%. Carta's 2020 e-commerce fraud report found that for every dollar of fraud blocked, merchants lose between $13 and $25 in declined legitimate revenue, which means a poorly tuned model causes more financial damage than the fraud it prevents.
The detection rate, also called recall, is the percentage of actual fraud your model catches. A model catching 80% of fraud is a reasonable baseline. Models at mature fintech companies, operating on years of labeled transaction data, typically reach 90–95%.
| Metric | What It Measures | Acceptable Range | Risk If Too High |
|---|---|---|---|
| False positive rate | Legitimate transactions blocked | Under 1% | Good customers abandon and churn |
| Detection rate (recall) | Actual fraud caught | 80–95% | Losses from uncaught fraud |
| Score calibration error | Gap between predicted and actual fraud rates | Under 5% | Thresholds behave unpredictably |
| Model drift rate | Score accuracy decline over time | Retrain if AUC drops >3% | Old model misses new fraud patterns |
Models drift because fraud patterns change. A model trained on 2020 transaction data will miss fraud tactics that emerged in 2021. The standard practice is to monitor the model's accuracy metric (called AUC, which measures how well the model separates fraud from legitimate transactions) on a rolling basis and retrain when it drops more than a few percentage points from its peak.
If you are evaluating whether a fraud scoring system is worth building, the calibration question is the first one to answer. A score that is not calibrated cannot be reliably thresholded, which means the business is making block-or-approve decisions without a consistent understanding of what the score actually means.
Building and maintaining this kind of system requires a team that has done it before: data engineers who understand the feature store requirements, machine learning engineers who know how to retrain and monitor models in production, and enough labeled historical transaction data to get the model to a usable detection rate. Timespade's predictive AI practice works with companies at the stage where the fraud problem is real but the internal team to solve it does not yet exist. A full predictive AI team, including engineers, data infrastructure, and ongoing model monitoring, costs a fraction of a dedicated in-house machine learning hire in the US.
