Research Community Brief

Executive Summary

Fabricated Citations, Failing Detectors: The Empirical Holes Beneath AI-Education Research

Of 2,287 higher-education AI items surfaced this week from a 6,252-source corpus, the evidentiary base tilts heavily toward deployment announcements and courtroom filings — not the measurement studies the field needs. The most diagnostic artifact of the week is a policy document, not a paper: South Africa’s national AI policy was found to cite fabricated, AI-generated references South Africa’s AI policy cited fake research, created by AI. The methodological failure at the policy layer mirrors gaps the learning sciences have yet to instrument at the classroom layer.

The undertheorized problem is the detection-assessment loop. Institutions are deploying detectors whose validity is contested in print Contra generative AI detection in higher education assessments, litigated in court AI Detection Lawsuits: Every Student Case, Outcome, and What the Data …, and routed around by students using “humanizers” that are themselves AI To avoid accusations of AI cheating, college students turn to AI. Meanwhile, the proposed alternative — authentic assessment redesign Beyond Detection: Redesigning Authentic Assessment in an AI … — lacks comparative effectiveness data against the assessment regimes it would replace. Resolving the loop empirically would require multimodal process data, not product scoring; the help-seeking comparison work Unpacking help-seeking process through multimodal learning analytics is one of the few designs pointing in that direction.

A second gap: institution-scale interventions are being announced as accomplished facts. Surrey is embedding AI in every degree from September 2026 Surrey embeds AI in every degree; Cal State’s OpenAI deal is already drawing refusal Cal State struck a deal with OpenAI. Neither rollout has a registered evaluation protocol attached.

This briefing maps unstudied questions in the detection-assessment loop, identifies methodological limits in current authentic-assessment scholarship, and flags institution-scale rollouts whose absent evaluation designs are themselves a research opportunity.

Critical Tension

The Theoretical Problem

The week’s evidence converges on a contradiction the field has not yet theorized cleanly. On one side, institutions are embedding generative AI at the curricular layer as if its outputs were a stable input to learning: Surrey will require AI fluency in every degree from September 2026 Surrey embeds AI in every degree from 2026; the Cal State system has signed an OpenAI deal that students and faculty are now refusing to use Cal State struck a deal with OpenAI. Some students and …; ASU has rolled out an AI Course Builder over faculty objections Faculty Concerned About ASU’s New AI Course Builder. On the other side, the same week produces evidence that the technology cannot yet do what the curricular embedding presumes: South Africa’s national AI policy was found to cite fabricated, AI-generated research South Africa’s AI policy cited fake research, created by AI, and a French analysis reframes the problem as one in which “AI knows how to produce everything… but not yet to judge,” casting users as “operators of abundance” L’IA sait tout produire… mais pas encore juger.

This is not a practical adoption lag. It is a theoretical gap. The dominant constructs — “AI literacy,” “augmentation,” “human-in-the-loop” — assume that judgment can be bolted onto a production system after the fact. The persuasion-bomb literature shows the opposite: generative outputs shape the evaluative frame before judgment engages How generative AI ‘persuasion bombs’ users. What the field lacks is a model of cognition under conditions of pre-formed abundance — a theory that specifies when human judgment can recover authority from a fluent draft and when it cannot. Until that model exists, “embed AI in every degree” is a curricular bet without an articulated learning theory.

Paradigm Limitations

The tool metaphor is doing most of the analytic work in the institutional documents this week, and it is doing it badly. ChatGPT Edu is described as a deployment for campuses ChatGPT Edu at OpenAI - OpenAI Help Center; ZotGPT at UC Irvine is positioned as helping faculty “design smarter classes” #AnteaterIntelligence: Designing Smarter Classes with ZotGPT. The framing forecloses questions about who specifies the goal state of “smarter,” whose epistemic norms the model encodes, and what happens to disciplinary judgment when the same vendor stack mediates instruction across thousands of courses. A Canadian Public Policy paper now frames AI as an institutional risk-and-retention instrument — the algorithm is not a tool but a policy actor Risk, Retention, and the Algorithmic Institution: Artificial Intelligence as a Policy Response to Higher Education in Crisis. That framing opens research questions the tool metaphor cannot: how does algorithmic mediation reshape the unit of analysis from the student to the cohort, and what counts as evidence of learning at the cohort scale?

Causal attribution in the assessment literature inherits the same limitation. The Adelphi lawsuit and the broader detection-lawsuit record locate failure in the student or the detector Adelphi University accused a student of using AI to plagiarize. He … AI Detection Lawsuits: Every Student Case, Outcome, and What the Data …; the redesign literature locates it in the assessment instrument Beyond Detection: Redesigning Authentic Assessment in an AI … - MDPI. Neither locates it in the procurement decision that introduced the model into the pedagogical relation in the first place.

Whose Knowledge Is Missing?

The corpus this week — 6252 articles total — is dominated by institutional, vendor, and faculty voices. Student accounts surface mostly in adversarial form: as defendants in detection cases To avoid accusations of AI cheating, college students turn to AI - NBC News, as plaintiffs against institutions An Adelphi University student was accused of using AI to … - Newsday, or as a refusal vote in a system-level deal Cal State struck a deal with OpenAI. What is absent is research that treats students as epistemic agents specifying their own learning goals against AI-mediated curricula — not as users whose engagement is measured, but as a population whose stated reasons for refusal are data.

Critical and community perspectives are even thinner. The entry-level-labor argument that AI is foreclosing the first rung of careers AI won’t kill your job — it will kill the path to your first one implies a research agenda — what does an undergraduate degree certify when the labor market it points to has been hollowed? — that the higher-ed AI literature has not taken up. Linguistic-minority perspectives are absent: Luxembourgish speakers cannot yet talk to these systems in their own language Parler à l’IA en luxembourgeois, un défi encore loin d’être …, and the discrimination risks in AI-mediated recruiting are documented but rarely connected to the same models now embedded in assessment Utiliser l’IA pour recruter ? Attention aux risques de …. A research program that centered these voices would be forced to abandon the tool metaphor — and that, more than any methodological refinement, is what the field’s theoretical development now requires.

Actionable Recommendations

Research Directions: Where the Evidence Points

The week’s evidence base — vendor partnerships, detection lawsuits, policy documents citing fabricated sources, and AI-built courses — exposes research questions the field has been slow to formalize. Below are five directions that follow from documented gaps rather than from the conference circuit’s preferred framings.

1. The Accused Student as Methodological Subject

Current gap: Detection-tool research is dominated by tool-builders measuring tool accuracy. The student on the receiving end of a false accusation is a footnote. The Adelphi University lawsuit (An Adelphi University student was accused of using AI to …; Adelphi University accused a student of using AI to plagiarize. He …) joins a growing docket (AI Detection Lawsuits: Every Student Case, Outcome, and What the Data …) — yet the empirical literature still treats false-positive rates as a statistical property rather than as a lived adjudicative experience.

The field has largely approached this through performance benchmarking, which misses what the accusation does to writing behavior, help-seeking, and institutional trust. Reporting already documents that students now run their own work through “humanizers” prophylactically (To avoid accusations of AI cheating, college students turn to AI - NBC News).

Research questions: - How do students whose work has been flagged (correctly or not) alter their drafting, citation, and revision practices in subsequent terms? - What evidentiary standards do academic-integrity boards actually apply when the only “evidence” is a probability score? - Do flagging rates vary by L2 status, disability accommodation, or discipline in ways that mirror prior assessment-bias literature?

Methodological considerations: Mixed-methods, with hearing-transcript analysis where institutions permit FOIA-equivalent access, paired with longitudinal interviews. The challenge is selection bias — students who sue are not representative. IRB review will be nontrivial where minors or active disciplinary cases are involved.

Potential contribution: Reframes detection from a classifier-accuracy problem into a due-process and pedagogical-trust problem, where the field has more productive theoretical purchase.

2. Vendor Terms as Curriculum

Current gap: When a system campus signs an enterprise agreement — Cal State with OpenAI (Cal State struck a deal with OpenAI. Some students and …), Surrey embedding AI across every degree (Surrey embeds AI in every degree from 2026), ASU rolling out an AI Course Builder over faculty objection (Faculty Concerned About ASU’s New AI Course Builder) — pedagogical decisions migrate into procurement contracts that faculty governance never reviews. The product page itself frames the campus as a deployment target (ChatGPT Edu at OpenAI - OpenAI Help Center).

The dominant approach treats these deals as IT decisions. That misses that the vendor’s content policy, default model behavior, and update cadence become de facto curriculum. The recent piece on “AI as policy response to higher education in crisis” (Risk, Retention, and the Algorithmic Institution) gestures at this but does not yet treat the EULA as a primary text.

Research questions: - What pedagogical commitments are encoded in the default settings, system prompts, and content filters of campus-deployed LLMs, and who reviewed them? - How do faculty senates’ charters intersect (or fail to intersect) with enterprise-software approval workflows? - When a model updates mid-semester and changes outputs students rely on, what is the institutional remedy?

Methodological considerations: Document analysis of contracts (where obtainable through public-records law at state institutions), comparative case studies across the Cal State / Surrey / ASU spectrum, and shared-governance ethnography. The Atlas of AI is the apt theoretical anchor here for opacity-as-infrastructure.

Potential contribution: Establishes the procurement document as a genre of educational policy — a move the field has resisted but the evidence now demands.

3. Longitudinal Effects on Judgment Formation

Current gap: Nearly all empirical work on student AI use is single-semester or shorter. The hypothesis that matters — that students trained as “operators of abundance” lose the capacity to evaluate what the machine produces (L’IA sait tout produire… mais pas encore juger) — cannot be tested on a 14-week cycle.

Research questions: - Across a four-year cohort, does sustained LLM use correlate with changes in source-evaluation behavior, hedging in argumentation, or willingness to defend a position under challenge? - Do effects differ between disciplines where AI integration is mandated (Surrey-style) versus optional? - What baseline measures of judgment formation are stable enough to track across years?

Methodological considerations: Cohort design with annual instruments; the obvious confound is that the technology itself will change repeatedly across the study window, so the construct under measurement must be the student’s evaluative practice, not the tool. Help-seeking research using multimodal traces (Unpacking help-seeking process through multimodal learning analytics) offers a usable methodological scaffold.

Potential contribution: Moves the conversation past short-term performance gains/losses to the durable cognitive question — which is where institutional decisions about curriculum embedding actually need to be made.

4. Sourcing Failure as an Institutional Pathology

Current gap: South Africa’s national AI policy was found to cite research that does not exist, generated by AI (South Africa’s AI policy cited fake research, created by AI). This is not a student-cheating story; it is a governance-document story. The research literature on hallucinated citations is overwhelmingly framed around classroom assessment, not policymaking.

Research questions: - How prevalent are fabricated citations in white papers, accreditation self-studies, and grant proposals at higher-education institutions? - What verification workflows existed before AI, and which of them are now load-bearing in ways their designers did not anticipate? - Are review committees adjusting their methods, or relying on counter-detection?

Methodological considerations: Forensic auditing of a representative sample of recent institutional documents; partnership with libraries, which have the citation-checking infrastructure already. Risk: institutions will not volunteer to be audited.

Potential contribution: Extends the integrity literature from individual-misconduct framing to institutional-process framing, which is where the systemic risk actually sits.

5. The Vanishing First Rung

Current gap: If agentic AI is hollowing out entry-level work (AI won’t kill your job — it will kill the path to your first one), the credential the institution sells loses the labor-market validation it was built on. Career-services research has not caught up.

Research questions: - Which majors and which institutions are seeing the sharpest entry-level placement declines, and over what timeframe? - How are students adapting their major choices and credential-stacking strategies in real time? - What does internship structure look like when the tasks an intern would have done are now agentic outputs?

Methodological considerations: Requires sustained access to placement data that institutions consider reputationally sensitive; consortia-based research (the kind AAU or APLU could convene) is more realistic than single-institution studies. The framing must resist both vendor optimism and labor-economist fatalism.

Potential contribution: Connects AI-in-education research to AI-and-labor research — two literatures that currently barely speak — at the precise joint where both communities’ assumptions are about to fail.

Supporting Evidence

The Evidence Base, Read Skeptically

What we actually have to work with. This week’s corpus pulled 6252 sources across all categories, with 2287 in higher education. The HE pile is heavily weighted toward institutional announcements and trade-press commentary — vendor help docs like ChatGPT Edu at OpenAI, campus communications such as UC Irvine’s #AnteaterIntelligence: Designing Smarter Classes with ZotGPT, and procurement news like Cal State struck a deal with OpenAI. Peer-reviewed empirical work is the minority strand: a multimodal learning-analytics comparison of Unpacking help-seeking process through multimodal learning analytics:A comparative study of ChatGPT vs Human expert, a Reimagining Writing Assessment for the AI Era: A Systematic Review on Balancing AI Support and Authentic Skill Growth, a policy analysis of Risk, Retention, and the Algorithmic Institution, and a critical pre-print Contra generative AI detection in higher education assessments. The ratio matters: the field’s center of gravity sits in announcement-and-reaction, not in measurement.

Whose voice is missing. Student perspectives appear almost exclusively as either subjects of detection regimes or as plaintiffs — see the An Adelphi University student was accused of using AI to … - Newsday and the ‘We could have asked ChatGPT’: students fight back over course taught by AI. What is largely absent: contingent and adjunct faculty (the labor most exposed to course-builder automation at places like ASU, per Faculty Concerned About ASU’s New AI Course Builder), Global South scholarly voices outside the South Africa policy fiasco, and minority-language communities — a gap visible even in tooling, e.g. Parler à l’IA en luxembourgeois. When evidence is collected primarily through institutional and vendor channels, the questions asked tilt toward efficiency and risk-mitigation, not toward labor displacement or epistemic justice.

Failure patterns the literature rewards — and the ones it doesn’t. The most-cited failure modes are integrity failures (cheating, detection error) and procurement failures (faculty uptake, governance gaps in AI Leadership in Education). Far less examined: epistemic failures of the sort the South African policy demonstrated — fabricated citations entering a national document — and labor-structural failures, like the entry-level-job collapse traced in AI won’t kill your job — it will kill the path to your first one. The field is well-resourced to study what students do; it is under-resourced to study what institutions and vendors do.

Discourse moves to watch. Two framings dominate. The first is inevitability — Surrey’s announcement that AI will be embedded in every degree from 2026 treats adoption as settled, not as a research question. The second is productivity-as-judgment — the conflation critiqued in L’IA sait tout produire… mais pas encore juger, where output volume substitutes for evaluative work. MIT Sloan’s reporting on persuasion bombs sits at an awkward angle to both: it names a mechanism most adoption studies don’t measure for.

Methodological thinness. The empirical work is overwhelmingly cross-sectional — single-semester, single-course, self-report-heavy. Longitudinal designs tracking the same cohort across an articulation pathway, or across a hiring-market entry, are essentially absent. The Beyond Detection: Redesigning Authentic Assessment in an AI … - MDPI proposes interventions faster than it measures them. Causal claims about learning outcomes routinely outrun the designs producing them, and almost no studies isolate vendor effects (ChatGPT Edu vs. Copilot vs. Claude) despite procurement decisions hinging on exactly that comparison.

What theory needs to do next. Three unresolved tensions are doing real work and deserve concept-building: the contradiction between detection regimes and the legal exposure they generate (visible in the Adelphi suits and surveyed in To avoid accusations of AI cheating, college students turn to AI); the gap between “AI literacy” as curricular goal and AI as silent infrastructure students are graded by; and the question — sharpened by Writing with machines? Reconceptualizing student work in the age of AI — of what authorship even denotes when the artifact is co-produced. None of these will resolve through more adoption surveys.