Marketing

AI underwriting on first-touch traffic

How to risk-score a user before their first transaction. Two-stage models, the signals that matter, and where LLMs change fraud rule-writing.

Hitpixel·Apr 18, 2026·8 min

AI underwriting on first-touch traffic is the discipline of forming a defensible risk view of a user before they have given you any transactional history. You have a session, an IP, a device, a few behavioral signals, and the BIN they are about to submit. From that, you have to decide whether to approve, challenge, or decline. Get it wrong in one direction and chargebacks eat the unit economics. Get it wrong in the other and you decline good customers who never come back.

This is the hardest underwriting problem in consumer payments because the ground truth label arrives weeks late, in small numbers, and from an adversarial counterparty. It is also the problem where modern ML methods have the largest payoff over the rule-based systems most operators still run.

The signal stack on a cold user

You have more than you think. The signals that consistently carry weight in production models, in roughly descending order of value:

Device fingerprinting

Not the 2014 canvas-fingerprint version. The 2026 version is a vector across a few dozen weakly-identifying browser, OS, and hardware features, hashed and compared probabilistically. A first-time user on a device that has been seen 400 times in the last week behind a residential proxy is not a first-time user. Vendors here include the open-source FingerprintJS Pro tier and the proprietary engines inside the larger fraud platforms.

IP reputation

Datacenter ranges, residential proxy networks, mobile carrier ranges, Tor exits, and known abuse pools each carry different default risk. The right operating posture is not to block by category but to weight the prior. A datacenter IP from a Hetzner range submitting a checkout for a $9 trial is one thing. The same range submitting a $400 order is another.

Behavioral biometrics

How fast the user types. Whether the cursor moves with human jitter or in straight lines. Time on the checkout page. Field-fill order. NeuroID and Castle have built businesses on this signal class. The single most predictive subfeature in our experience is paste-vs-type on the card number field, because card-testing scripts almost universally paste.

Declared-vs-detected geography

The user says they are in Manchester. The IP geolocates to Sao Paulo. The browser timezone is UTC+8. The shipping address is in Lagos. None of these alone is decisive. The disagreement pattern across them is.

BIN intelligence

The first six (now eight) digits of the card identify the issuer, country, card type, and product tier. Prepaid BINs from certain neobanks correlate strongly with first-party fraud. Corporate BINs in B2C funnels often signal expense-policy abuse rather than fraud. Stripe Radar's documentation is the clearest public reference on how a modern processor uses BIN data alongside network signals.

The two-stage model

The architecture that wins in production is two-stage. Most teams skip the first stage and pay for it.

Stage one is a cheap, fast, deterministic pre-filter. Hard rules and a small linear model that runs in under 50 ms on every session. The job of stage one is to kill the obvious. Bot signatures, known-bad device hashes, IPs on a current attack list, BINs in active card-testing campaigns. Stage one should reject around 5 to 15 percent of first-touch traffic in a typical consumer funnel without ever touching the deep model.

Stage two is the expensive model. Gradient-boosted trees or a small transformer over the full feature set, called only on the traffic that survived stage one. This is where you spend your inference budget, because the marginal user is the one worth thinking about carefully. Stage two outputs a calibrated probability, not a binary, and the threshold is set per product based on the cost of a false approve versus a false decline.

This is the same architecture Sift and Riskified deploy under their respective brand names. The Sift fraud index publishes aggregate trend data each quarter that is worth reading even if you never buy the platform. The Square engineering blog on risk has published useful material on threshold calibration that translates directly to consumer payments.

How LLMs change rule-writing

The historical bottleneck in fraud is not modeling. It is feature engineering. An analyst notices a pattern in a Slack thread. "We are seeing a cluster of orders with German shipping addresses, US-issued prepaid cards, and a specific user-agent string." Translating that hunch into a production feature used to take an engineer two weeks. Often it never happened, and the pattern became tribal knowledge that left the company when the analyst did.

LLMs collapse that loop. The analyst describes the pattern in English. The model proposes a feature definition, generates the SQL or Python to materialize it, drafts the unit tests, and surfaces the historical false-positive rate against the last 90 days of labeled data. The analyst reviews and approves. The feature is in production the same afternoon.

This is not speculative. It is how the better fraud teams already operate. The implication for operators is that the cost of a useful new fraud rule has dropped roughly an order of magnitude in the last 18 months, and the teams that have rebuilt their rule-writing workflow around it are catching attack patterns days faster than the teams that have not.

The risk in this workflow is that it makes it cheap to ship bad rules. Discipline matters. Every LLM-proposed feature should ship behind a shadow-mode evaluation period and a documented sunset date if it does not pull its weight.

What this looks like in production

Hitpixel engineers payment gateways for clients in regulated and high-trust verticals, which means we see the same first-touch underwriting problem in many flavors at once. The architecture above is what runs underneath. A new launch on a client engagement starts with a richer prior than a standalone team could build alone, because the engineering primitives are shared across our practice. The technical detail is on our technology page; the verticals we engineer for are on the portfolio page.

The operating posture

A few principles that consistently separate the teams that get this right:

Calibrate, do not just classify. A model that outputs "fraud or not" is harder to operate than one that outputs a probability you can threshold per product.
Measure the cost of a false decline as carefully as the cost of a chargeback. Most teams measure one and guess the other.
Keep human review in the loop for the middle band. Full automation at the extremes, analyst review on the contested middle 5 percent, and feed those decisions back into the training set.
Adversarial drift is the default state. A model trained six months ago is decaying right now. Retraining cadence is a first-class operational metric.

The closing opinion

AI underwriting on first-touch traffic is one of the few remaining areas in consumer payments where the gap between a sophisticated operator and a default vendor configuration is wide enough to matter at the unit-economics level. The two-stage architecture is well understood. The signals are documented. The LLM-assisted rule-writing workflow is available to anyone who wants to build it. What is in short supply is the operational discipline to retrain on schedule, measure both error modes honestly, and resist the temptation to treat the deep model as a black box that does not need inspection. That discipline is the actual moat.

marketingAIattribution