The Arithmetic of False Accusation

I. The question of the week

The detector arrived as an answer to a panic. When generative writing tools became something any undergraduate could open in a browser tab, the institutional reflex was not to ask what the technology meant for the assignment but to find a machine that could police the old assignment unchanged — and the plagiarism vendors, Turnitin foremost among them, shipped one within months. The promise was clean: paste in the essay, receive a percentage, know who cheated. What the promise concealed is the subject of this week’s column. A classifier that decides whether a sentence was written by a person or a model does not return the truth; it returns a guess, expressed with a confidence the underlying mathematics does not earn. And every guess of that kind comes with a second number that the marketing does not print on the box — the rate at which it calls the innocent guilty.

That second number is where the cost lives. The arc this column traces runs between two things that have moved at different speeds: a conversation that spent most of 2025 talking about AI detection as a manageable problem, a tool to be tuned and trusted, and a body of statistical reality that has not budged at all. The conversation flipped from early skepticism to a long season of solution-optimism and only lately began to sour again. The reality — that rare events misclassified at scale produce far more false alarms than true catches, that detectors learn to flag surface features rather than authorship, that running every student through the same test guarantees you will accuse some of them wrongly — was true before the first detector launched and will be true after the last one is switched off. This is a piece about why those two stories took so long to meet, and about who pays the interest while they stay apart.

II. What we’ve been saying

The talk arrived in a recognizable shape. At the close of 2024 the conversation about AI’s failure modes was still faintly skeptical in tone — the few voices addressing detection and its costs leaned critical, treating the new tools as suspect before they had been adopted. That posture did not last. Across the first two quarters of 2025 the framing inverted hard toward optimism, and the optimism took a specific form: not “this works” so much as “this can be made to work.” The risk was real, the genre conceded, but it was a risk to be governed. 10 AI dangers and risks and how to manage them, IBM’s contribution to the genre, is built entirely on that grammar — name the pitfall, then hand over the management plan, the danger always safely upstream of a procedure. How to build safe, secure and trustworthy AI capabilities makes the same move at the level of institutional process: caution is invoked precisely so that deployment can proceed. The AI Dilemma: Powering the Future or Fueling Our Fears? stages the fear in its title and then resolves it, the way the genre almost always does, on the side of the future.

What made detection so seductive a thing to talk about optimistically is that it flatters a particular fantasy: the rare wrongdoer, caught by a tireless machine. The fantasy is not new and it is not confined to classrooms. As Janelle Shane writes in You Look Like a Thing and I Love You (2019), “Many of the most tempting problems to solve with AI are also problems prone to issues of class imbalance. It’s handy to use AI for fraud detection, for example, a situation where it can weigh the subtleties of millions of online transactions and look for signs of suspicious activity.” The banking sector reached for exactly that promise — AI and Your Money catalogs fraud detection as one of the technology’s settled wins — and the academic-integrity industry borrowed both the architecture and the rhetoric. Catching cheaters is fraud detection wearing a mortarboard. The trouble, which Shane names in the same breath and which the optimistic season declined to hear, is that “tempting” and “prone to issues” are describing the same systems.

The optimism was never quite unbroken, and it is worth being precise about where the cracks ran. Even at the height of the favorable framing, a counter-current insisted that the claims were running ahead of the goods. I’ve worked in AI for 15 years. There are a few telltale signs of AI washing. named the practice directly — companies rebranding ordinary software as artificial intelligence, exaggerating what their systems could do, with the warning aimed squarely at “regulated industries” where the gap between claim and capacity carries consequences. A detector sold to a dean is a regulated-industry product in everything but name; its output ends up in a disciplinary file. That a longtime technologist felt the need to publish a field guide to the exaggeration is itself the tell.

By the third quarter of 2025 the critical share of the conversation was climbing again, and the skepticism had matured from “they’re overselling” into “some of this should not be built at all.” Not Every Problem Needs AI: A Solution Architect’s View On Responsible Tech made the heretical argument from inside the industry: that the right number of AI deployments for a given problem is sometimes zero, and that the discipline worth cultivating is the discipline of declining. It is the sentence the detection debate had been avoiding. Plagiarism is a problem; it does not follow that a probabilistic classifier is the answer to it, and the long optimistic middle of 2025 had treated that inference as automatic.

This is also a conversation the publication has been having from an adjacent direction. An earlier essay on the social aspects of AI, in our briefing of 2025-09-16, traced how systems sold as neutral arbiters tend to redistribute their errors toward people already short of standing — a pattern that returns, below, as the central fact about who a detector’s false positives actually land on. The discourse, in other words, contained its own corrective the whole time. It simply spent three quarters talking over it.

III. What’s been happening

While the framing oscillated, the mathematics held still, and the mathematics is unforgiving. Start with the structure Shane identified. Cheating, like fraud, is the rare class — most submitted essays are honestly written. A classifier evaluated on its accuracy will look superb simply by guessing “honest” almost every time, which means the numbers vendors quote are nearly meaningless until you separate them into the two rates that matter: how often the tool catches a real cheat, and how often it convicts an honest student. That second rate need not be large to be catastrophic, because it is multiplied by the size of the honest majority. A detector that flags one honest essay in a hundred sounds reassuringly precise until it is run across a university’s entire submission stream, at which point it manufactures wrongful accusations by the thousand — more false alarms, in absolute terms, than there are cheaters to catch.

That structural problem is compounded by a second one, which is that these systems do not detect authorship at all. They detect correlates of authorship — statistical fingerprints like the evenness or predictability of a text — and treat those proxies as the thing itself. Shane’s most useful parable is about an image classifier built to recognize a fish called a tench. When its makers looked under the hood, “it showed them that it was looking for human fingers against a green background,” because the training photos were mostly trophy shots of anglers holding their catch. The detector had learned the human hands, not the fish. AI writing detectors learn their own version of human hands: clean, even, low-surprise prose. The students who write that way without any machine’s help — non-native English speakers drilled in formulaic construction, neurodivergent writers, anyone whose style runs to the plain and the regular — are the anglers’ fingers in the frame. They get flagged not because they cheated but because they resemble the proxy. The error is not a bug to be patched in the next release; it is what the system is.

A third multiplier finishes the job, and it is the one institutions understand least. Michael Kearns and Aaron Roth, in The Ethical Algorithm (2019), explain why testing many things at once corrupts the meaning of any single positive result. Their illustration is the multiple-comparisons problem and the correction it requires: “if you are reporting some event, then (absent fraud) it must have happened at least once during your k tries” — which is why the unlikeliness of a result has to be multiplied by the number of times you went looking, “called the Bonferroni correction, after Italian mathematician Carlo Emilio Bonferroni.” A detector run across every paper in every section of every course is performing that experiment tens of thousands of times a term. Improbable misfires become not just possible but expected; somewhere in that volume, the machine will produce a near-certain-looking accusation against someone who did nothing, purely as a function of how many times it rolled the dice. The institution reads the high-confidence flag as evidence. It is, statistically, sampling error wearing a uniform.

None of this is exotic knowledge, and the broader record of AI’s reliability problems has been accumulating in plain sight. Stanford’s own running audit, the HAI AI Index Report 2024, documents both the speed of the field’s advance and the lag in our capacity to measure what it does — and the 2026 follow-up sharpens the point. Inside the AI Index: 12 Takeaways from the 2026 Report reports that “AI’s capabilities are advancing quickly; less so, our ability to measure and manage them,” which is the whole detection problem stated as an index finding. Reliability remains the soft spot everywhere it is examined: AI could save billions but healthcare adoption is slow attributes the caution precisely to “bias, algorithm drift, and unclear regulations,” and AI Ethics and Regulatory Risk records how readily AI’s confident-sounding outputs — hallucinations and misinformation — erode the trust that confident-sounding outputs depend on. The pattern is consistent across domains: these systems are most dangerous exactly where they are most fluent, because fluency is mistaken for accuracy.

And still the money flows toward them. Your AI Budget Is Growing. Your Returns Aren’t. Here’s Why. reports from Bain’s survey that nearly 40 percent of companies measuring AI cost savings landed below 10 percent, well short of their targets, “yet 90% are increasing their budgets again.” HIMSS25: Navigating the cost-benefit dilemma of health AI describes the same buyer’s predicament — a market so flooded with tools that organizations are “spoiled” for choice and starved for evidence. Detection sits inside this economy. The procurement decision precedes the proof, the budget renews regardless, and the burden of the unmeasured error rate is exported downstream to the people with the least power to contest it.

IV. Where they meet, where they miss

They miss, most fundamentally, on the meaning of a number. The optimistic conversation of 2025 spoke of detection accuracy as though a percentage were a verdict — as though “94 percent confident” described the student rather than the model’s uncertainty about the student. The statistics say something the marketing cannot afford to say: that a confidence score is a property of the classifier, not a fact about the accused, and that the same score which exonerates a thousand honest writers in aggregate will, by the arithmetic of base rates and repeated testing, brand a predictable handful of them as cheats. The rhetoric and the reality use the same digits and mean opposite things by them. That is the gap the whole arc has been failing to close.

Where they meet is in the late, grudging admission that some problems should be left unsolved by these tools — the position Not Every Problem Needs AI staked out, and the closest the discourse has come to catching up with the math. But notice how the rest of the AI economy resists that conclusion even as it documents the case for it. The 2026 Private Equity AI Radar describes adoption “beginning to create real separation” among funds, the competitive pressure that makes declining to deploy feel like unilateral disarmament. IBM Warns of AI Security Gaps in Latest Data Breach Report finds adoption “outpacing security,” with the overwhelming majority of breached firms lacking basic access controls — the same shape as the detection story, capability shipped ahead of the governance that would make it safe to use. The institution that buys a detector is not behaving irrationally within this economy. It is behaving exactly as the economy rewards: adopting first, auditing never, and calling the result diligence.

This is where the column’s commitments become unavoidable. The cost of a false positive is not distributed by lottery. It concentrates, as our earlier essay of 2025-06-29 on the social aspects of AI argued, on the people whose prose already departs from the institutional default — the second-language writer, the student without the cultural fluency to mount a confident appeal, the one for whom a disciplinary flag is not an inconvenience but a visa, a scholarship, a future. The detector does not invent the inequality; it launders it, converting a stylistic difference into a presumption of guilt and handing the presumption to an authority that reads machine confidence as proof. A plagiarism case used to require a human to point at a source. The detector requires only a number, and the number, as we have seen, is a guess about the model wearing the costume of a fact about the student. The vendor sells certainty; the institution buys absolution from the labor of judgment; and the student supplies the error rate out of their own record. Anti-mystification means saying that transaction plainly: it is not a safeguard, it is a transfer of risk from the powerful to the accused.

V. The longer view

The detector will not improve its way out of this, because the problem is not a defect in any particular product but the structure of the task. You cannot drive the false-positive rate to zero without ceasing to flag anyone, and you cannot run a rare-event classifier across a whole population without generating wrong answers in proportion to how often you ask. The optimism of 2025 treated those facts as engineering challenges awaiting a better release. They are not. They are the terms of the trade, known since long before the first essay was pasted into the first box, set down plainly in the statistics the buyers chose not to read. The honest response was never a better detector. It was the harder, older institutional work the detector was purchased to avoid — designing assignments that resist substitution, and judging the rare real case with a human being’s eyes and a human being’s accountability for being wrong.