Seventy-Three Percent Is Not Proof

I. The question of the week

A high school freshman in Wake County, North Carolina, turned in an English assignment and got back an accusation. A detector had flagged her writing as machine-made; the grade and the suspicion followed. She did not accept it. She petitioned her district for clear rules on how such tools may be used against a student, and her case became, briefly, a small public argument about evidence — what counts as proof that a teenager cheated, and who gets to decide.

That argument is the topic of this week’s column, and it has a longer history than the single case suggests. When generative writing tools arrived in classrooms, detection software was sold as the answer — a way to restore the old order of authorship by mechanical means. The conversation around those tools has since traveled a long way, from confidence to unease to a curious posture the trade press now calls “making peace.” What has not traveled nearly as far is the underlying machine, which remains, by the best independent measurement, substantially less accurate than its vendors claim.

The arc of this piece runs along that gap. The rhetoric has matured — it has learned to sound reasonable, balanced, even contrite about false accusations — while the reality underneath it has barely improved. A tool that is right roughly three-quarters of the time is being used to make binary judgments about individual people, and the discourse has quietly reorganized itself around accommodating that fact rather than refusing it. The question of the week is not whether detectors work. It is what we have decided to do about the fact that they don’t.

II. What we’ve been saying

The detection story opened inside a louder one. By late 2024, the dominant register for talking about artificial intelligence was the register of awe shadowed by dread — the technology was advancing “faster than most of us can comprehend,” and the public was invited to react with fascination and fear in roughly equal measure. In that climate, detection software had an obvious rhetorical appeal. It promised to convert a vague civilizational anxiety into a manageable classroom procedure: run the essay through the checker, read the percentage, act on it. The appeal was less about the technology than about the relief it offered to institutions that did not know what else to do.

The relief did not survive contact with scrutiny. Across the first half of 2025, a steady counter-current ran through the broader AI conversation — a growing willingness to ask whether the products did what their makers said. By spring, the trade press was running explicit reliability checks against the hype, and longtime practitioners were naming the practice of “AI washing,” in which firms exaggerate their capabilities and rebrand ordinary software as intelligent. By summer, the critique had hardened into a public-interest argument: that audiences were being “bombarded with positive messages” and needed to learn to challenge the “good AI” myth that technology companies push. The skepticism that detectors would eventually face was first rehearsed on the industry as a whole.

When that skepticism reached the detectors themselves, it did not produce refusal. It produced a vocabulary of accommodation. The clearest marker of the turn is the headline that now frames the educator conversation: not ban the detectors, not the detectors are broken, but making peace with AI detectors in schools. The phrasing is worth sitting with. One makes peace with a difficult neighbor, a chronic condition, a fact of life — not with a faulty instrument that one is free to put down. The metaphor smuggles in a premise: that the tool is here to stay and the task is adjustment rather than judgment.

Alongside the peace-making register runs a softer, interrogative one — the genre that poses the problem as an open question rather than a settled wrong. The representative title asks whether AI detectors are really serving students, and the question mark does real work. It signals concern while deferring the verdict; it invites the reader to weigh a “double-edged sword” rather than to conclude that one edge is cutting people who did nothing wrong. This is the house style of the responsible-AI conversation generally — the posture that every harm is a tradeoff and every tradeoff is, in principle, balanceable.

The measurable shift in the conversation is one of volume and valence together. Through 2024 and early 2025 the detection topic surfaced only intermittently, and when it did the critical framings slightly outnumbered the optimistic ones. From the second quarter of 2025 onward the topic became loud — and, counterintuitively, the optimistic framings pulled ahead, with the reassuring “making peace” and “are they serving students” register outnumbering the flatly critical one. The discourse did not get more alarmed as the evidence accumulated. It got more accommodating. That inversion is the rhetorical event of this arc: the moment the conversation stopped asking whether to use these tools and started asking how to live with them.

There is precedent for distrusting that move. The most useful corrective comes from the journalism-and-academia projects that, as the journalist Meredith Broussard observed in Artificial Unintelligence (2018), have tried to put “a new, more balanced view of AI” on the horizon — institutions like the AI Now Institute, founded to interrogate exactly the systems being sold as neutral. Broussard’s “balanced” is not the trade press’s “balanced.” Hers means refusing to grant a technology the benefit of the doubt it has not earned; the trade press’s means splitting the difference between the vendor and the victim. The conversation about detectors has been using the second sense while borrowing the credibility of the first.

III. What’s been happening

Underneath the maturing rhetoric, the instrument itself has not been reformed so much as exposed. The single most important number in this arc comes from the independent studies aggregated in the same reporting that counsels peace-making: detection tools perform 15 to 33 percentage points worse than their vendors claim, with real-world accuracy averaging around 73 percent. Hold those figures against the use to which the tools are put. A 73-percent-accurate instrument is not deciding whether to recommend a film. It is being read as evidence in a disciplinary proceeding, where the output is treated as a verdict and the burden of disproving it falls on a teenager.

Worse than the average is the distribution of the error. The same body of evidence finds that non-native English writers face disproportionately high false-positive rates — their prose, often more formulaic by the ordinary effort of writing in a second language, reads to the detector as machine-made. This is not a random malfunction; it is a structured one, and it has a familiar shape. The mechanism was described, without reference to classrooms, by the researcher Janelle Shane in You Look Like a Thing and I Love You (2019): an image classifier trained to recognize a fish called a tench learned instead to look for “human fingers against a green background,” because nearly every photograph of a tench in its training data showed an angler holding the catch. The model was confident and wrong in a patterned way — it had learned a correlate of the thing instead of the thing. A detector that flags second-language phrasing as synthetic has made the tench’s mistake. It is finding fingers, not fish.

Shane names the other half of the trap as well. Many of the most tempting problems to hand to AI, she writes in the same book, “are also problems prone to issues of class imbalance” — the example she reaches for is fraud detection, where the system must weigh millions of cases to find the rare suspect ones. Cheating is exactly such a problem: most submissions are honest, and a detector tuned to catch the few dishonest ones will, at scale, generate false positives in large absolute numbers even at a modest error rate. The arithmetic is unforgiving. Run a 73-percent-accurate filter across a school district’s essays and the wrongly accused are not an edge case. They are a population.

That population has begun to surface in the record. The Wake County case is one node; the broader reporting documents false accusations carrying real academic-career consequences, the kind that follow a student into transcripts and recommendation letters. The pattern is consistent with the wider audit of education technology conducted by the Center for Democracy and Technology, whose review of AI edtech failures found a market that has “exploded” with tools claiming capabilities the products do not reliably deliver. The detector is not an anomaly in that market. It is a representative specimen.

None of this is novel to anyone who has watched algorithmic decision-making move into high-stakes settings. The canonical warning predates the chatbot panic entirely. As the philosopher Mark Coeckelbergh recounts in AI Ethics (2020), the risk-assessment algorithm used by judges in Florida for parole and sentencing decisions was found by the newsroom ProPublica to produce false positives — defendants “predicted to re-offend but who actually did not” — that fell disproportionately on Black defendants, while the false negatives fell disproportionately on white ones. The lesson there is precisely the lesson here: a system’s errors are not neutrally scattered, and the people who absorb them are rarely the people who chose to deploy the system. The detector has reproduced the structure of an older injustice in a new room.

The reliability problem also compounds across the broader information environment in which these tools sit. Generative systems themselves now fabricate at a measurable rate — roughly 30 percent of some AI-generated outputs carrying hallucinated content, which is to say the same institutions deploying flawed detectors are also fielding flawed generators, and the empirical literature on whether detectors can even keep pace remains unsettled — the question framed in the dedicated detector study catalogued in the AI Index. What has been happening, in short, is not a tool getting better while we learned to talk about it more wisely. It is a tool standing still while the talk grew more comfortable.

IV. Where they meet, where they miss

The rhetoric and the reality meet on one honest point: nearly everyone now concedes the detectors are imperfect. The “double-edged sword” framing, the “are they really serving students” question, the willingness to publish the 73-percent figure at all — these are not denial. The conversation has absorbed the fact of error. That is genuine progress over the early period, when the percentage on the screen was treated as a reading off an instrument rather than a guess with a confidence interval.

They miss on what the concession obligates. “Making peace” treats unreliability as a weather condition — something to be endured with better umbrellas, clearer district guidelines, a human in the loop to soften the machine’s verdict. But the COMPAS precedent that Coeckelbergh documents in AI Ethics (2020) is not a story about umbrellas. It is a story about due process: about whether it is permissible to let a patterned, unaccountable error rate determine a person’s standing when that person cannot inspect or contest the basis of the accusation. A 73-percent-accurate detector run on an honest student is not a tradeoff she agreed to. It is a tax on her, levied to spare the institution the harder work of judgment.

This is where the column’s sympathies are not balanced, and should not be. The reader of this piece is more likely to be — or to love — the accused freshman than the procurement officer who licensed the tool. And the structural fact is that the costs and the benefits sit with different people. The institution gets efficiency and the appearance of rigor; the falsely flagged student gets the burden of proving a negative against a number she cannot audit. The “making peace” frame is attractive precisely because it is addressed to the party holding the power, for whom peace is available. It is not on offer to the student, who did not start the war.

The equity dimension sharpens the miss. An earlier essay in our briefing of 2025-09-16 on the social aspects of these systems noted how readily tools sold as inclusion-enhancing reproduce the exclusions they claim to dissolve. The detector is a textbook case: a device that disproportionately misfires on non-native English writers, deployed in the name of fairness, producing a new unfairness with a clean technical alibi. Shane’s tench is the mechanism — the system keys on the wrong signal — but the consequence is not a misclassified fish. It is a student told that the very effort of writing in her second language looks, to the machine, like fraud.

What both the optimists and the question-askers tend to skip is the option of refusal. The market data points one way — leaders themselves are increasingly skeptical, with a substantial share holding “exaggerated expectations” of what these systems deliver — yet the institutional default remains adoption with mitigation rather than abstention. The honest synthesis is not that detectors need a human in the loop. It is that a 73-percent instrument has no business being in the loop of a disciplinary decision at all, and that “making peace” is the wrong verb for a tool one is free to set down.

V. The longer view

The detector’s defenders will say the alternative is chaos — a classroom with no way to tell authored work from generated work. That is the fear the technology was sold to relieve, and it is real. But the answer to an unanswerable question is not a confident wrong answer dressed as evidence. The provost who licenses a 73-percent tool to adjudicate cheating has not solved the problem of authorship; she has relocated its cost onto whichever students the machine happens to misread, and she has done it under cover of a number that looks like proof and isn’t. The maturing of the conversation — its new fluency in “tradeoffs” and “peace” — has mostly served to make that relocation sound responsible.

The longer view is that this arc is not really about detection. It is an early, legible instance of a pattern that will recur everywhere institutions face a hard human judgment and a vendor offering to automate it: the tool arrives, fails quietly, and the discourse adjusts to accommodate the failure rather than reject it. The students flagged in Wake County and elsewhere are the first to feel the bill, because the falsely accused always pay before the rest of us notice the rate.