It didn’t start with a paper. It started with a classroom.
I was teaching a unit on classic social psychology — the foundational studies that most of us in the field absorbed as canonical truth. Milgram’s obedience experiments. Zimbardo’s Stanford Prison Experiment. Baumeister’s work on ego depletion, the idea that willpower is a depletable resource like muscle fatigue. A student raised her hand partway through the session — not to be difficult, genuinely puzzled — and asked: “But how do we actually know any of this is true? Has anyone tried it again?”
I gave the kind of answer professors give when they don’t quite have one. I talked about replication in general terms, about peer review, about the weight of accumulated citation. It wasn’t a dishonest answer. But it wasn’t a complete one, either. Because the honest answer, as I knew even then, was something like: we mostly just believed it, because it had been believed for long enough.
That question has been following me since. And a new study published in 2026 — the most comprehensive effort yet to test the credibility of social science research — has given it a much sharper edge.
What the study found
The short version: roughly half of published social science findings, when subjected to careful replication attempts, do not hold up.
The longer version is more instructive. Coordinated by Brian Nosek, a psychology professor at the University of Virginia and executive director of the Center for Open Science, the project involved 865 collaborators from research institutions around the world. Across three papers published together in Nature, the project reviewed nearly 3,900 studies spanning 62 journals; from that pool, 164 papers were selected for active replication attempts across 54 journals, spanning eleven areas of social science — including health sciences, leadership, and organizational behavior, well beyond the psychology-heavy focus of earlier efforts.
That is to say: flip a coin. That’s approximately the odds that a finding you’ve read, cited, built a policy recommendation around, or taught in an undergraduate seminar will survive a serious attempt to reproduce it.
This isn’t the first time we’ve been here
For those of us who have been paying attention, the number isn’t a shock. In 2015, Nosek’s team conducted what was then the largest replication effort in psychology’s history — attempting to reproduce 100 psychology studies originally published in three leading journals. They succeeded only 39% of the time. The field registered a kind of collective flinch, debated whether the methodology was fair, and largely moved on.
The 2026 study is different in scale and scope. It extends across disciplines, across methodological traditions, across a decade of published literature. And the result — 49% — is in the same neighborhood as before. The wound hasn’t healed. It’s just bigger now.
“These numbers are unsurprising, in the sense that they are consistent with what we’ve seen in prior investigations. Nevertheless, they still are surprising in that our natural assumption as a reader of the published literature is that most of the findings would be replicable if we tried to do it again.”
— Brian Nosek, Center for Open Science
That tension Nosek names — unsurprising and yet somehow still surprising — is exactly where I live with this. As someone who has spent years both consuming and teaching this literature, I’ve known in the abstract that the field has problems. What I haven’t fully metabolized is the scale of what “knowing in the abstract” has let me off the hook from saying plainly.
The problem with the canonical
Here is what troubles me most: the findings that failed to replicate weren’t obscure papers buried in specialist journals. They were, in many cases, precisely the opposite. They were the ones with elegant, counterintuitive findings that made it into textbooks. The ones that became shorthand — “ego depletion,” “power posing,” “priming effects” — because they told a clean, compelling story about human behavior. The ones everyone cited, because everyone else already was.
The 2026 study includes a concrete example of what this looks like in practice. A 2012 paper published in the Journal of Organizational Behavior claimed a link between extraversion and emotional attachment to one’s organization — the idea that extroverted people bond more strongly to their workplace. UVA researchers, including Ryan Wright, tried to replicate this finding across six universities, measuring both extraversion and sense of organizational belonging. The original link did not appear.
That’s not a marginal finding being quietly corrected. That’s the kind of result that feeds into HR practices, hiring frameworks, and team-building decisions — exactly where social science findings go when they’re compelling enough to travel beyond the lab.
Nosek put it clearly when explaining why this matters beyond the lab: “Usually in research, we’re making a more general claim. ‘When you eat this kind of food, you have this kind of health outcome.’ And that should apply if you eat that food tomorrow.” The whole point of social science findings, at their most useful, is that they describe something durable about human behavior — not a one-time artifact of a particular sample, moment, and lab setting.
Why some findings don’t survive
It would be simpler if the problem were fraud. It mostly isn’t. The more common culprits are structural: publication bias toward positive, novel results; small sample sizes that are underpowered to detect real effects reliably; analytical flexibility that lets researchers, in good faith, make choices that inflate the apparent strength of a finding; and a career incentive structure that rewards discovery over rigor. None of these are individual failings. They are properties of the system social science built for itself over decades.
The 2026 study also attempted something novel: using artificial intelligence to predict which papers would replicate before the replication attempts were made. The AI showed some signal, but the researchers concluded the models were not yet reliable enough to serve as meaningful filters — a finding the SCORE team described as leaving considerable room for improvement before such tools could be used at scale.
What to do with this
I want to resist two easy conclusions. The first is the triumphalist one — “science is broken, you can’t trust any of it” — which is both factually wrong and practically useless. Replication failure doesn’t mean a finding is certainly false; it means it didn’t hold under different conditions, which is itself information. The second is the dismissive one — “this is just how science works, findings get revised, it’s fine” — which uses the self-correcting nature of science as a shield against reckoning with how slowly and incompletely it actually corrects itself, and how much damage an unreplicated finding can do in the years before anyone checks.
The more honest position is something like: hold findings more lightly. Distinguish between a result and a mechanism. Ask not just whether a study found something, but how many times it’s been found, by whom, with what populations, under what conditions. Understand that replication is not a test of whether a researcher was competent or honest — it’s a test of whether a phenomenon is real and general enough to travel across contexts. Many interesting things aren’t, and that’s not a scandal; it’s a limit worth knowing.
When my student asked how we know any of this is true, she wasn’t being contrarian. She was doing exactly what the field should have been doing all along. The uncomfortable answer — the honest one — is that we often don’t know as reliably as we’ve implied. What we can do is say so, build in that uncertainty, and stop mistaking the confidence of citation for the confidence of evidence.
That’s a harder classroom to teach in. It’s also a truer one.