The AML model your examiner just flagged was built in 2017. Here is what to do this quarter.

It is a particular kind of Friday afternoon. The exit conference is over, the examiners have left, and the CRO is forwarding the draft Matter Requiring Attention to a small circle. The finding reads, more or less, as it has read for a decade: the institution’s transaction monitoring system is not calibrated to its current risk profile. The thresholds are stale. The tuning methodology is not documented in a form the examiner can replicate. The below-the-line testing program has lapsed. The alert-to-SAR conversion rate is “inconsistent with the institution’s risk profile” — which is examiner code for: your alerts are theater, and your SARs do not come from the alerts you are generating.

I have written six versions of the report that closes this MRA in the past three years. The shape of the work is always the same. The shape of the failure is usually the same too. This piece is what I wish I could hand to a BSA officer on the Monday after.

What the finding usually says

The wording varies — MRA, MRIA, supervisory letter, consent order — and the regulator varies — OCC, FDIC, NCUA, the Federal Reserve, FinCEN through one of them — but the substance is almost always one or more of:

The TM thresholds were set at implementation and have not been formally revisited.
The institution cannot produce a tuning methodology that another quantitative reader could replicate.
Below-the-line testing has not run on a defined cadence, or has not run at all.
The rule library is out of step with current typologies — no rules for fintech-rail layering, no rules for crypto on-ramps, no refresh of the layering typologies introduced in the last three years of FinCEN advisories.
Sanctions screening false-positive rates are high enough that meaningful hits are statistically diluted.

Every one of these is a calibration problem dressed up as a controls problem. They are real findings — the regulator is right — but they are fixable inside one examination cycle if the work is sequenced correctly.

Why the 2017 calibration broke

Most TM systems were calibrated by the vendor at implementation, in a two-week study, against three months of data — usually a sanitized subset because the institution did not have a comfortable way to extract production data at that point. The thresholds that came out of that study were appropriate for the institution that existed at the moment of implementation. Then the bank grew. The product mix changed. A new BaaS channel came online. The CFO restructured the corporate-treasury accounts to centralize liquidity. The wire room added a new correspondent. Each of those was a calibration event that did not trigger a recalibration.

The model that was right in 2017 is not the model that is right in 2026 — and the institution’s failure was not at calibration. The failure was that no one re-ran the study.

This matters because the remediation is not “rebuild the model.” It is “establish the discipline that the original calibration was supposed to have set up.” That distinction is the difference between a six-month engagement and a sixteen-month one.

Step 1 · Pull the data

The first deliverable is not a report. It is a clean dataset. Pull eighteen months of alert history with disposition codes, twelve months of SARs with the rule(s) that generated them, the production rule library with current thresholds, and a sample of transactions across the population — typically two to three percent depending on volume.

If the institution cannot produce this data inside two weeks, that is itself a finding. The TM system that cannot be queried at scale is a system that cannot defend itself.

Step 2 · Below the line

Below-the-line testing means: pick a sample of transactions that fell below your current threshold and would not have generated an alert. Investigate them. If a meaningful share would have been worth investigating, the threshold is set too high.

This is the test the examiner is most likely to ask you to demonstrate. It is also the test most institutions cannot show on a defined cadence. The expected output is a methodology memo that another quantitative reader could replicate, a sample design, the disposition of every sampled transaction, and a quantitative justification — with confidence intervals — for the threshold change you intend to make.

A defensible BTL study uses random stratified sampling, a sample size that produces a 90% confidence interval of ±10 percentage points around the suspicious-rate estimate, and disposition by an analyst who did not write the rule. Document each of those decisions; the methodology memo is half the value of the study.

Step 3 · Above the line

Above-the-line testing is the inverse: pick a sample of alerts the system did generate and investigate whether the disposition was correct. A bad alert-to-SAR conversion rate is usually caused by alerts that should never have fired — false positives — not by analysts who failed to file SARs they should have.

Above-the-line testing produces three deliverables: the disposition of the sample, a categorization of false positives by root cause (overbroad rule, threshold too low, incorrect customer segmentation, duplicate-alert generation), and a proposed set of rule changes that address the most common causes. The last is what the BSA committee acts on.

Step 4 · Rule refresh

The current rule library is almost certainly missing typologies the regulator now expects you to cover. The recent FinCEN advisories on convertible-virtual-currency layering, fentanyl-related funds movement, and elder-financial-exploitation patterns have updated the typology landscape several times in the past three years. New BaaS and fintech-rail typologies — funds moved through a sponsor-bank-fronted neobank into a peer-to-peer payment system into a crypto on-ramp — are not captured by rule libraries built before 2021.

For each typology you add, document: the typology description, the rule logic that detects it, the threshold proposed, the sample of transactions used to calibrate the threshold, and the expected alert volume. The institutions that add five new typologies without that documentation will produce a flurry of alerts the analysts cannot work and a tuning study the examiner cannot validate.

Step 5 · Document

The tuning study has to live in a document that a quantitative reviewer at the FRB or the OCC could replicate. That is the standard. The minimum content is: scope of the study, data extracted and how, sampling methodology, statistical assumptions and confidence intervals, the disposition methodology, the analyst training and quality-control protocol, the findings, the threshold and rule changes proposed, the expected impact on alert volume and SAR yield, and the governance — who signed off on the changes and when.

This document is also the deliverable that goes to the model risk management committee under SR 11-7 if the institution has classified TM as a model — which most institutions now do.

Step 6 · Defend it in the room

The most important moment of the remediation is the one where the BSA officer, the CRO, and the validator sit across the table from the examiner and walk them through the methodology. The document is the artifact; the conversation is the test. Examiners are trained to find the place where the methodology is non-replicable, where the threshold has no quantitative justification, and where the documentation paper-overs something the institution does not want to discuss.

If you cannot defend it in the room, you have not finished the work — even if the document is two hundred pages long.

A note on partner-led validation

The reason this work fails in practice is rarely capacity. The institution can pay a consulting firm to produce the document. The reason it fails is that the document is written by people who are not quantitative practitioners, validated by people whose review consisted of reading the executive summary, and signed off by a partner who never opened the data file. The examiner reads documents by people who have done the work. The validator who signs the report has to be one of them. That is the line we draw.