top of page

People Are Using Super Mario To Benchmark AI Now

jchouinard9


For years, Pokémon has been considered a challenging benchmark for Artificial Intelligence (AI), but a group of researchers from Hao AI Lab, based at the University of California San Diego, suggests that Super Mario Bros. might be an even tougher test. On Friday, Hao AI Lab put AI through the paces of a live Super Mario Bros. game, and the results were surprising.


While Pokémon’s turn-based mechanics are complex, Super Mario Bros. forces AI to react in real time, making it a significantly more demanding challenge. The lab’s tests revealed that Anthropic’s Claude 3.7 outperformed the competition, with Claude 3.5 coming in second. However, Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled to perform well in this fast-paced, decision-making environment.


The Setup: Not Your Average Mario Game

It’s important to note that this wasn’t the original 1985 Super Mario Bros. running on a vintage console. Instead, the game was emulated and integrated with a framework called GamingAgent, developed in-house by Hao AI Lab. This setup allowed the AI models to control Mario in the game.


GamingAgent provided the AIs with basic instructions, such as “If an obstacle or enemy is near, move/jump left to dodge,” along with in-game screenshots. The AIs then generated Python code to control Mario’s movements, which included everything from dodging enemies to jumping over gaps.


Despite this setup being a bit of a departure from the classic game, Hao AI Lab believes it was enough to push the AI models to learn and adapt to complex maneuvers and develop intricate gameplay strategies. The researchers also found something unexpected: reasoning models like OpenAI’s o1, which tend to think through problems step by step, actually performed worse than non-reasoning models—despite being more successful in many other benchmarks.


Why Do Reasoning Models Struggle?

The core reason for this disparity comes down to the nature of real-time decision-making in video games. While reasoning models are excellent at solving problems in a methodical way, they tend to take several seconds to arrive at a solution. And in the world of Super Mario Bros., that’s a major disadvantage. Timing is everything—what might seem like a small delay can mean the difference between narrowly avoiding an enemy or falling into a pit.


Super Mario Bros. isn't about carefully planned moves but rather about reacting instantly to situations as they arise. The researchers found that models focused on reasoning struggled to keep up with the rapid pace of the game, unable to make decisions quickly enough to succeed. In contrast, models that didn't rely on step-by-step reasoning fared better because they could act faster.


AI in Games: More Than Just Fun and Games?

Games have long been used as benchmarks to evaluate AI, but some experts, including Andrej Karpathy, a research scientist at OpenAI, have questioned whether they truly reflect AI’s capabilities in real-world applications. Games are abstract and often simplified versions of complex tasks, providing a controlled environment for training and testing. Furthermore, games can generate an almost infinite amount of training data, which isn’t always available in real-world scenarios.


This brings us to what Karpathy refers to as the “evaluation crisis” in AI. He admitted that he’s uncertain about which AI metrics to trust. "TLDR my reaction is I don’t really know how good these models are right now," he shared in a recent post on X (formerly Twitter).


The Bigger Picture

Despite the uncertainties in measuring AI's true progress, watching AI struggle—and sometimes succeed—in games like Super Mario Bros. is still fascinating. It provides a glimpse into the future of AI, showing us how far these systems can go when it comes to real-time decision-making and adapting to unpredictable scenarios.

So, while we may not yet have the definitive metrics to gauge AI’s true abilities, one thing is clear: Super Mario Bros. is no easy feat, and AI still has a lot to learn about navigating the complexities of the game world. But at least we get to watch the adventure unfold.




Put your IT environment to the test with a FREE Cybersecurity Assessment. This in-depth evaluation identifies vulnerabilities, uncovers potential risks, and offers actionable insights to enhance your cyber resilience. Don't wait for a breach to happen, empower your business with the knowledge to safeguard your data and reputation.


 
 
 

Comments


bottom of page