Over the past year, dozens of studies have warned that AI is poised to upend white-collar work. Yet buried in the footnotes of nearly all of them is a detail that ought to give readers pause: the exposure scores driving the headlines are not calculated by economists. They are generated by an AI model — most often GPT-4, in a 2024 study by OpenAI — which reads occupation descriptions and decides how automatable each task is.
That methodology has now been stress-tested. Michelle Yin, a researcher at Northwestern University, took all 705 occupations in the US occupational coding scheme and ran the original analysis through four different models: the GPT-4 used in the OpenAI study, plus newer systems from OpenAI, Anthropic and Google. The results were jarring. Estimates of the share of jobs at risk swung from under 15 per cent when judged by Google's Gemini to 50 per cent when judged by Anthropic's Claude. On the share of jobs more than 10 per cent exposed, GPT-4 said around 50 per cent, its successor GPT-5 put the figure just above that, and Claude 4.5 — the newest model tested — landed at 80 per cent.
The disagreement doesn't just shift a number; it can flip the entire conclusion. Using GPT-4's scores, AI exposure had a weak negative effect on employment, suggesting modest job loss. But using Gemini's scores, the same regression produced a weak *positive* effect — meaning the jobs flagged as most exposed actually grew. Same data, same methodology, different judge, opposite story.
Why do the models diverge? Yin attributes part of the gap to newer systems 'knowing' more about their own expanded abilities and about emerging AI tools that didn't exist when GPT-4 was trained. But there is also a stylistic component: newer models are simply more confident, and confidence inflates exposure scores even where real capability has not changed. Claude, in particular, rated occupations from CEOs to factory-floor supervisors as highly exposed to automation. Gemini, asked the same question, treated those same roles as relatively safe.
The authors argue that the fix is straightforward in principle: any serious analysis of real-world AI impact should run its exposure measure through multiple models and compare the results. Where the models agree, conclusions can be trusted. Where they diverge — as they often do — the honest answer is uncertainty. They also point to a broader implication. The EU's GDPR already gives individuals the right to a 'human review' of consequential automated decisions, such as a denied loan or a rejected job application. A logical extension would be requiring a second or third AI 'opinion' as well — running the same decision through a different vendor's model to see whether the outcome holds. Hiring platforms like HireVue, which feed video interviews through proprietary models to score candidates, would be obvious test cases.
The deeper question the study raises is interpretive. Different AI systems are not just calibrated differently; they appear to be thinking about the labour market in fundamentally different ways. One views a CEO's job as a stack of automatable tasks; another sees it as judgement, leadership and risk-taking that a chatbot cannot replicate. Until that gap is understood, the confident percentages in newspaper headlines say less about the future of work than they do about the model that produced them.
Dozens of headlines claim AI is coming for white-collar work — but the entire field rests on a secret most readers miss: the verdicts are written by the AI models themselves, and they wildly disagree.
Economists keep publishing studies estimating how 'exposed' jobs are to AI disruption. Buried in the footnotes of most of these studies is an awkward detail: the exposure scores aren't produced by humans. They're produced by an earlier AI model — usually GPT-4 from a 2024 OpenAI paper — which read job descriptions and rated how automatable each task is.
A new study by Michelle Yin at Northwestern ran the exact same methodology across four different models (the original GPT-4, plus newer ones from OpenAI, Anthropic and Google) on all 705 occupations in the US classification system. The models disagreed dramatically. Estimates of how many jobs are at risk ranged from under 15% (Gemini) to 50% (Claude). On GPT-4, the share of jobs more than 10% exposed was about 50%; on GPT-5, just over 50%; on Claude 4.5, a striking 80%.
The issue here isn't really about AI — it's about what happens when a measurement instrument is also the thing being measured. A few parallels make the problem sharper:
If you're choosing a college major or thinking about which career path is 'AI-proof,' the advice you're getting is built on shaky foundations. Policymakers writing retraining programmes, companies planning layoffs, and journalists writing scary headlines are all leaning on numbers that swing 5x depending on which chatbot was asked. The honest answer to 'will AI take this job?' is closer to 'it depends who you ask — including which AI you ask.'
This is an early example of a problem that will define the next decade: AI systems are increasingly the judges, scorers and gatekeepers in decisions that affect real lives — hiring, lending, medical triage. The Yin study suggests a future where 'model disagreement' becomes its own field, and where regulations might require decisions to be cross-checked across competing AI systems before they stick. Watch for the first lawsuit where someone argues they were denied a job by one model that another model would have approved.