Are We Testing AI’s Intelligence the Wrong Way?

When it comes to understanding the true nature of artificial intelligence, many people turn to Melanie Mitchell, a computer scientist and professor at the Santa Fe Institute. Her 2019 book, *Artificial Intelligence: A Guide for Thinking Humans*, has played a significant role in shaping the modern discussion about what AI systems can and cannot do. Recently, at NeurIPS—the largest annual conference for AI professionals—Mitchell delivered a keynote titled “On the Science of ‘Alien Intelligences’: Evaluating Cognitive Capabilities in Babies, Animals, and AI.” Before her talk, she shared insights with IEEE Spectrum about why today’s AI should be studied more like nonverbal minds, what lessons AI researchers can learn from developmental and comparative psychology, and how improved experimental methods could transform how we assess machine cognition.

Mitchell uses the phrase “alien intelligences” to describe both AI and biological minds such as babies and animals. She explained that the term comes from a paper by neural network pioneer Terrence Sejnowski, who compared ChatGPT to a space alien that can communicate with us and appears intelligent. Developmental psychologist Michael Frank also uses this idea, noting that developmental psychology studies alien intelligences—namely, babies. Mitchell’s point is that the methods used to study these “alien” minds might be useful for analyzing AI intelligence as well.

Current Challenges in Evaluating AI Cognition

When people talk about evaluating intelligence in AI, they often mean different things. Some focus on reasoning, others on abstraction or world modeling. Mitchell prefers the term “cognitive capabilities” because it is more specific. She looks at how developmental and comparative psychology evaluate these capabilities and tries to apply similar principles to AI.

Today, AI evaluation typically involves running systems through benchmark tests and reporting accuracy scores. However, Mitchell points out that even though many AI systems perform exceptionally well on these benchmarks—sometimes surpassing humans—this success often does not translate to real-world performance. For example, an AI that passes the bar exam might not make a good lawyer in practice. These systems often excel at specific questions but struggle to generalize beyond them. Moreover, many tests designed for humans assume abilities like memorization that may not be relevant or accurate for AI systems.

Mitchell also notes a gap in training. As a computer scientist, she did not receive education in experimental methodology. Yet, conducting experiments on AI systems is now a core part of evaluating them. Most computer scientists lack this crucial training, which limits how well AI cognition is assessed.

Lessons from Developmental and Comparative Psychology

Developmental and comparative psychologists have extensive experience probing the cognition of nonverbal agents like babies and animals. They use creative and rigorous experimental methods, including carefully controlled experiments and varied stimuli to test robustness. These fields pay close attention to failure modes—understanding why a system fails often reveals more about its workings than successes do.

Mitchell shared a classic example from comparative psychology: Clever Hans, a horse believed to perform arithmetic and counting by tapping answers with its hoof. For years, researchers thought Hans was genuinely solving problems. But a psychologist conducted control experiments by blindfolding the horse and placing a screen between Hans and the questioner. When Hans could no longer see the questioner, he failed the tasks. It turned out Hans was responding to subtle facial cues from the questioner, not actually doing math. This example highlights the importance of considering alternative explanations and being skeptical—even of one’s own hypotheses—a practice Mitchell feels is lacking in AI research.

Another case study from developmental psychology involved babies and moral reasoning. Researchers showed infants videos of a cartoon character trying to climb a hill, with one character helping and another hindering. The babies appeared to prefer the helper character. However, a follow-up study found that in the helper videos, the climber bounced excitedly at the top of the hill, while in the hinderer videos, the climber did not bounce. When the hinderer videos were altered so the climber bounced at the bottom, babies’ preferences reversed, showing they were responding to the bouncing, not moral behavior. This example underscores the need to test alternative hypotheses carefully.

Mitchell emphasizes that skepticism should be a positive trait in AI research. Being skeptical means questioning assumptions and rigorously testing ideas, which is fundamental to good science.

Replication and Scientific Rigor in AI Research

Mitchell also points out that replication is a cornerstone of science but is undervalued in AI research. Replicating experiments and building on others’ work is often seen as lacking novelty and is discouraged by reviewers at major conferences like NeurIPS. This attitude hampers scientific progress because replication and incremental advances are essential for deepening understanding.

Regarding measuring cognitive capabilities of AI, there is much discussion about progress toward artificial general intelligence (AGI). Mitchell notes that AGI is a vague concept with many definitions. Because it is not well defined, measuring progress toward it is difficult. Our ideas about AGI keep evolving, partly in response to developments in AI itself. Early AI discussions focused on human-level intelligence and robots performing all human physical tasks. Later, the focus shifted to cognitive aspects of intelligence, though Mitchell believes these aspects are not easily separable.

She describes herself as somewhat skeptical of AGI, in a constructive way. This skepticism reflects the challenges in defining and measuring such a broad and evolving concept.

In summary, Melanie Mitchell’s insights suggest that when we ask “are we testing AI’s intelligence the wrong way?” the answer may well be yes. Current evaluation methods often fail to capture the true cognitive capabilities of AI systems. By learning from developmental and comparative psychology, adopting rigorous experimental methods, embracing skepticism, and valuing replication, AI research can develop better ways to understand and measure machine intelligence. This approach could lead to a more accurate and meaningful assessment of what AI systems truly know and can do.

For more stories on this topic, visit our category page.

Source: original article.

Avatar

By Futurete

My name is Go Ka, and I’m the founder and editor of Future Technology X, a news platform focused on AI, cybersecurity, advanced computing, and future digital technologies. I track how artificial intelligence, software, and modern devices change industries and everyday life, and I turn complex tech topics into clear, accurate explanations for readers around the world.