
An international group of scientists tested leading artificial intelligence language models using the Stroop test—a classic psychological tool for measuring attentional focus. The results were surprising. The study was published in the journal PNAS Nexus. The test works as follows: subjects are shown color names written in a different color and must name the actual color, ignoring the word itself. For example, the word “red” written in blue requires the answer “blue.” Humans handle this task quite confidently, even with long lists—the brain can suppress automatic responses. Led by Suketu Patel, the researchers administered this test to models GPT-4o, Claude 3.5 Sonnet, GPT-5, Claude Opus 4.1, and Gemini 2.5. With short lists of 5 words, all systems performed well. However, as list length increased, accuracy dropped sharply: GPT-4o at 5 words gave 91% correct answers, at 10 words—only 57%, and at 40—merely 15%. Claude 3.5 held up to 20 words, after which accuracy collapsed to 24%. According to the study’s authors, the models essentially “forget” the instruction and revert to what they were most heavily trained on—reading words. This fundamentally distinguishes them from humans, who can sustain voluntary attention for extended periods.