Personality tests are taken by an estimated 60 to 70 per cent of job applicants in the United States who reach a certain stage in the hiring process. They are administered in clinical settings, used in couples counselling, distributed at corporate retreats, and taken voluntarily by millions of people every year who want to understand themselves better. The Myers-Briggs Type Indicator alone has been administered to roughly 50 million people. And the research case for using these instruments to predict how a specific person will actually behave in a specific situation is, after decades of study, limited.
That gap between use and evidence is not a secret. It has been documented since at least the 1960s. The persistence of the tools anyway is one of the more instructive facts about how organisations make decisions, and about what people actually want when they reach for a personality framework.
What the research on predictive validity actually shows
The foundational challenge was laid out with unusual clarity by Walter Mischel in his 1968 book Personality and Assessment. Mischel reviewed the available evidence and found that the correlation between personality test scores and actual behaviour in specific situations was consistently weak, typically in the range of 0.2 to 0.3. He called this the personality coefficient. The implication was not that personality did not exist but that people’s behaviour across different situations was far more variable than trait-based models assumed. The person who is organised at home may not be organised at work. The person who is extroverted with close friends may be reserved in formal settings. The same underlying traits, whatever they are, express differently depending on context in ways personality inventories largely fail to capture.
The debate that followed, sometimes called the person-situation debate, occupied personality psychology for most of the 1970s and 1980s. It was not cleanly resolved. The eventual consensus was something like: broad traits have modest predictive power over aggregated behaviour across many situations, but their ability to predict what a specific person will do in a specific situation is low. That is a meaningful distinction if you are, say, trying to decide whether to hire someone for a particular role.
The Big Five personality model, developed through factor analysis of trait language and refined through the work of Paul Costa and Robert McCrae among others, is probably the most rigorously validated framework currently in use. Conscientiousness, in particular, has a reasonably consistent positive association with workplace performance across a wide range of studies. The correlations are real. They are also modest, typically around 0.2 to 0.3, meaning the test explains somewhere between four and nine per cent of the variance in outcomes. The remaining ninety-plus per cent is explained by other things.
The MBTI problem
The Myers-Briggs Type Indicator is worth treating separately because its popularity far exceeds its validity evidence. It classifies people into 16 binary types based on four dimensions: introversion-extroversion, sensing-intuition, thinking-feeling, and judging-perceiving. The instrument has been studied at length, and the most consistent finding is that the types are not stable: a significant proportion of people who take the test and then retake it four to five weeks later receive a different type classification. Research on retest reliability has consistently found that roughly 50 per cent of respondents receive a different type classification over a short retest interval — a figure acknowledged even in favourable reviews of the instrument, though the Myers-Briggs Company disputes the figure on the grounds that it derives from a 1979 version of the instrument no longer in use.
A classification system that reassigns roughly half its users on a second sitting is measuring something, but it is not measuring a stable underlying trait. What it may be measuring is something closer to mood, immediate context, or the way a person interprets the questions on a given day. None of those are trivial things to understand about someone. They are simply not what personality is generally taken to mean, and not what the instrument’s promotional materials claim it is capturing.
The continued use of the MBTI in hiring and team development is often justified on grounds other than predictive validity: it gives people a vocabulary for discussing differences, it prompts reflection, it creates a shared frame. These are not nothing. But they are arguments for a communication tool, not a selection instrument. The distinction matters because using personality classification in hiring decisions introduces the possibility of systematic bias, removes decisions from the texture of actual interview performance, and gives people scores they didn’t earn in categories that may not reflect how they will behave in the job.
Why the tools persist anyway
The more interesting question is not whether personality tests predict behaviour well. The evidence is clear enough: they predict some things weakly, other things not at all, and the prediction degrades sharply when you move from aggregate tendencies to specific situations. The question is why organisations and individuals keep using them.
Part of the answer is that decisions have to be made, and something that produces a number or a category feels more defensible than a purely subjective judgement. Hiring managers who select or reject candidates based on test scores have an artefact they can point to. That is psychologically useful even if the artefact is not particularly informative. There is also a genuine selection pressure in large organisations toward tools that can be applied at scale, consistently, without requiring expert interpretation at each application. Personality tests are cheap to administer and score. Structured behavioural interviews, which have stronger predictive validity evidence, are expensive and require trained interviewers.
In therapeutic settings the picture is somewhat different. Clinicians use standardised instruments for genuine reasons: to track change over time, to communicate efficiently with other clinicians, to satisfy insurance and institutional requirements. Some instruments used in clinical contexts have reasonable construct validity for their stated purpose. The problem is when a tool designed to describe broad tendencies gets used as though it explains specific behaviour, or when a label derived from an inventory becomes the dominant frame for understanding a person.
What self-knowledge is actually like
One underexamined dimension of the personality test literature is the finding that people are often not particularly accurate about their own traits. Simine Vazire has done substantial work on self-other agreement in personality, and the consistent picture is that other people’s assessments are often better predictors of behaviour for evaluative traits like intellectual ability, while self-reports are stronger for internal states like neuroticism; for socially visible traits like extraversion, self- and other-ratings predict equally well. For traits that are more internal, like neuroticism and anxiety-adjacent states, self-reports are more accurate. The domains in which we know ourselves best are not always the domains we tend to focus on when we seek self-understanding.
This matters because a large part of the appeal of personality testing is the promise of insight. People take the tests to learn something about themselves. What the tests reliably return is a description that the person then interprets in the light of what they already believe. The confirmation bias in self-assessment is well-documented: we tend to accept test results that match our prior self-concept and subtly discount those that don’t. The test, in this sense, is often less a mirror than a surface that reflects what we brought to it.
The gap between a useful frame and a predictive tool
None of this amounts to an argument that personality frameworks are worthless. A useful vocabulary for discussing how people tend to engage with situations, prefer to process information, or generally orient toward other people can support genuine reflection and better communication. The Big Five in particular, with its evidence base and its avoidance of the typological trap, offers something more than most. What it offers is a broad description, not a prediction.
The risk is treating the description as though it were a prediction. Hiring someone because they are high in conscientiousness is not meaningless, but it should not substitute for finding out whether they can do the specific tasks the role requires. Telling a therapy client that they are high in neuroticism describes a broad tendency; it does not explain their particular difficulties, and if the label starts doing the work that careful attention to the actual person should be doing, it becomes an obstacle rather than a tool.
The persistence of personality tests, across six decades of somewhat unflattering validity research, is itself a kind of data. It suggests that the demand being met by these instruments is not quite the one they officially claim to serve. People want to be categorised, perhaps because categories make the self feel more legible. Organisations want selection processes that look systematic, perhaps because that is more legally and culturally defensible than acknowledging the degree of uncertainty involved. The tests survive because they are doing social and psychological work that their predictive validity scores do not capture.
That is a less flattering account than the brochure provides. It is probably closer to what is actually happening.