technology · economics · June 01, 2026

Who Decides Which Jobs AI Will Take? The Answer Will Surprise You

No reader ratings yet.

📰 Reading Passage

Over the past year, dozens of studies have warned that AI is poised to upend white-collar work. Yet buried in the footnotes of nearly all of them is a detail that ought to give readers pause: the exposure scores driving the headlines are not calculated by economists. They are generated by an AI model — most often GPT-4, in a 2024 study by OpenAI — which reads occupation descriptions and decides how automatable each task is.

That methodology has now been stress-tested. Michelle Yin, a researcher at Northwestern University, took all 705 occupations in the US occupational coding scheme and ran the original analysis through four different models: the GPT-4 used in the OpenAI study, plus newer systems from OpenAI, Anthropic and Google. The results were jarring. Estimates of the share of jobs at risk swung from under 15 per cent when judged by Google's Gemini to 50 per cent when judged by Anthropic's Claude. On the share of jobs more than 10 per cent exposed, GPT-4 said around 50 per cent, its successor GPT-5 put the figure just above that, and Claude 4.5 — the newest model tested — landed at 80 per cent.

The disagreement doesn't just shift a number; it can flip the entire conclusion. Using GPT-4's scores, AI exposure had a weak negative effect on employment, suggesting modest job loss. But using Gemini's scores, the same regression produced a weak *positive* effect — meaning the jobs flagged as most exposed actually grew. Same data, same methodology, different judge, opposite story.

Why do the models diverge? Yin attributes part of the gap to newer systems 'knowing' more about their own expanded abilities and about emerging AI tools that didn't exist when GPT-4 was trained. But there is also a stylistic component: newer models are simply more confident, and confidence inflates exposure scores even where real capability has not changed. Claude, in particular, rated occupations from CEOs to factory-floor supervisors as highly exposed to automation. Gemini, asked the same question, treated those same roles as relatively safe.

The authors argue that the fix is straightforward in principle: any serious analysis of real-world AI impact should run its exposure measure through multiple models and compare the results. Where the models agree, conclusions can be trusted. Where they diverge — as they often do — the honest answer is uncertainty. They also point to a broader implication. The EU's GDPR already gives individuals the right to a 'human review' of consequential automated decisions, such as a denied loan or a rejected job application. A logical extension would be requiring a second or third AI 'opinion' as well — running the same decision through a different vendor's model to see whether the outcome holds. Hiring platforms like HireVue, which feed video interviews through proprietary models to score candidates, would be obvious test cases.

The deeper question the study raises is interpretive. Different AI systems are not just calibrated differently; they appear to be thinking about the labour market in fundamentally different ways. One views a CEO's job as a stack of automatable tasks; another sees it as judgement, leadership and risk-taking that a chatbot cannot replicate. Until that gap is understood, the confident percentages in newspaper headlines say less about the future of work than they do about the model that produced them.

📎 Download Original ⬇ Download Analysis PDF

📖 Explanation

Dozens of headlines claim AI is coming for white-collar work — but the entire field rests on a secret most readers miss: the verdicts are written by the AI models themselves, and they wildly disagree.

📖 What's Going On?

Economists keep publishing studies estimating how 'exposed' jobs are to AI disruption. Buried in the footnotes of most of these studies is an awkward detail: the exposure scores aren't produced by humans. They're produced by an earlier AI model — usually GPT-4 from a 2024 OpenAI paper — which read job descriptions and rated how automatable each task is.

A new study by Michelle Yin at Northwestern ran the exact same methodology across four different models (the original GPT-4, plus newer ones from OpenAI, Anthropic and Google) on all 705 occupations in the US classification system. The models disagreed dramatically. Estimates of how many jobs are at risk ranged from under 15% (Gemini) to 50% (Claude). On GPT-4, the share of jobs more than 10% exposed was about 50%; on GPT-5, just over 50%; on Claude 4.5, a striking 80%.

🎯 How To Think About It

The issue here isn't really about AI — it's about what happens when a measurement instrument is also the thing being measured. A few parallels make the problem sharper:

Imagine grading the SAT by asking four different teachers to score the same essays — and one gives a 14% pass rate while another gives 80%. The 'score' tells you more about the grader than the student.
It's like asking restaurants to rate their own food. The newer, more confident models rate more tasks as AI-doable — partly because they genuinely can do more, but partly because they're trained to sound capable.
Or think of pre-GPS election polls: same voters, same questions, but the methodology each pollster picks quietly determines the headline number.

💡 Key Things To Know

The foundational study most headlines trace back to is a 2024 OpenAI paper that used GPT-4 to score 705 occupations.
When Yin re-ran the analysis using Gemini's scores, AI exposure had a weak *positive* effect on employment — the most-exposed jobs actually grew. Flipping to other models flips the story.
Claude rated occupations from CEOs to factory-floor supervisors as highly exposed; Gemini rated the same jobs as low-exposure. The models are 'thinking' about the question in fundamentally different ways.
The EU's GDPR already gives people the right to a 'human review' of consequential automated decisions (loans, job applications). A natural next step is requiring a second or third AI 'opinion' — running the same case through a different vendor's model.
Most people assume one neat number ('X% of jobs at risk') reflects reality. It actually reflects which model the researcher happened to query.

🌟 Why It Matters

If you're choosing a college major or thinking about which career path is 'AI-proof,' the advice you're getting is built on shaky foundations. Policymakers writing retraining programmes, companies planning layoffs, and journalists writing scary headlines are all leaning on numbers that swing 5x depending on which chatbot was asked. The honest answer to 'will AI take this job?' is closer to 'it depends who you ask — including which AI you ask.'

🔮 The Bigger Picture

This is an early example of a problem that will define the next decade: AI systems are increasingly the judges, scorers and gatekeepers in decisions that affect real lives — hiring, lending, medical triage. The Yin study suggests a future where 'model disagreement' becomes its own field, and where regulations might require decisions to be cross-checked across competing AI systems before they stick. Watch for the first lawsuit where someone argues they were denied a job by one model that another model would have approved.

📚 Key Terms Glossary

AI exposure

An estimate, usually expressed as a percentage of an occupation's tasks, of how much of a job could plausibly be done by AI. It's a measure of *potential* automation, not actual job losses.

Large language model (LLM)

An AI system trained on huge amounts of text to predict and generate language. Examples include GPT-4, GPT-5, Claude and Gemini. In this article they're being used as judges that score jobs.

Occupational coding scheme

A standardised government list that divides the labour market into discrete jobs — the US version contains 705 occupations and is used to track employment statistics.

Methodology

The specific recipe a study follows — what data it uses, how it measures things, what assumptions it makes. Two studies can use the 'same' methodology but reach opposite conclusions if one ingredient changes.

GDPR

The European Union's General Data Protection Regulation, which gives individuals legal rights over how their personal data is used, including the right to demand human review of important automated decisions.

Non-deterministic

A system that can give different answers to the same input on different runs. Most modern LLMs are non-deterministic, which is part of why their exposure scores vary.

Devil's advocate

Someone who argues a position they don't necessarily believe in, in order to stress-test the opposite view. Used in the article to introduce a counter-argument about why newer models' higher scores might still be useful.

← Previous (older)

The Real Reason Gen Z Can't Get Hired? It Might Not Be AI

Next (newer) →

Meet the 44-Year-Old Heir Now Steering $1.8 Trillion of UAE Wealth