Evaluating AI in the Trenches: Why Tetris Stumps Frontier Models and What It Means for Business

Craig Leppan · 25 February 2025 · 4 min read

(A Business-Friendly Perspective on AI Evaluations)

The Tetris Litmus Test: A Reality Check for AI

When leading AI models (DeepSeek, Grok, Gemini, OpenAI) fail to build a functional Tetris-playing AI, it reveals a key limitation: AI struggles with real-world adaptability. Tetris isn’t just a game—it’s a test of quick decision-making, strategic planning, and execution under pressure. These are the same qualities businesses need AI to excel in.

So, why does this matter for business? Because AI evaluation methods (evals) often miss this practical aspect. Many tests focus on theoretical skills rather than how AI performs in unpredictable, high-stakes scenarios—like managing supply chains, automating workflows, or improving customer service.

Challenging AI: My Experience with Frontier Models

To put AI’s problem-solving skills to the test, I challenged leading models—including DeepSeek, OpenAI’s GPT series, Grok, and Gemini—to build a fully functional, AI-powered Tetris game that could play itself in a simulated browser-based environment. The results? Consistently underwhelming.

While these models could generate fragments of code and describe the logic behind Tetris, they struggled with:

Consistently implementing game mechanics (e.g., detecting when a line clears).
Creating an AI player that adapts in real time rather than relying on static logic.
Ensuring smooth execution without errors, requiring multiple iterations and debugging.

This experience underscored the gap between AI’s ability to describe a solution and its ability to execute it reliably—a critical distinction for businesses considering AI adoption.

How AI Is Currently Evaluated—and Where It Falls Short

AI models are tested using structured benchmarks that measure their ability to answer questions, generate text, or complete specific tasks. Some of the most common evaluation types include:

Text-based tasks (e.g., answering questions, summarizing content)
Static benchmarks (e.g., knowledge quizzes, coding exercises)
Narrow, well-defined challenges (e.g., identifying objects in images)

These tests are great for checking if an AI understands fixed, predictable problems—but they fall apart when AI needs to adapt dynamically. Tetris exposes this gap because it requires AI to make decisions based on a changing environment, just like business operations.

Why Tetris Shows AI’s Weaknesses

Constant Change – The game doesn’t pause. AI must adapt to shifting conditions in real time.
Long-Term Strategy – Immediate actions impact future success, requiring foresight and planning.
Beyond Text Processing – AI must interpret visuals, anticipate movements, and adjust strategies on the fly.

Business Impact: If AI struggles with Tetris, it may also struggle with real-world applications like forecasting demand, managing inventory, or automating workflows.

Where Traditional AI Testing Falls Short

Many current AI evaluation methods don’t account for:

Unpredictability – AI is often tested in controlled, static conditions instead of dynamic ones.
Decision Chains – AI must make multiple interconnected decisions rather than one-off responses.
Real-Time Constraints – Many business applications require instant decision-making under pressure.

For businesses, this means:

✅ AI performs well in structured tasks (e.g., customer support scripts, automated data entry).
❌ AI struggles in real-time decision-making (e.g., fraud detection, crisis response, logistics management).

The Business Risk of Relying on Weak AI Evals

If companies use flawed AI evaluation methods, they risk:

Overestimating AI’s Abilities – Just because AI can pass a coding test doesn’t mean it can handle complex automation.
Underestimating Deployment Costs – Many businesses invest in AI only to find they need extensive modifications.
Operational Failures – AI that fails unpredictably in real-world conditions can cause major disruptions.

How Businesses Can Make AI Testing More Practical

Test AI in Your Own Environment
- Use real-world simulations (like logistics, fraud detection, or process automation) instead of static benchmarks.
Combine AI with Rule-Based Systems
- AI works best when combined with traditional algorithms for structure and consistency.
Evaluate AI on Performance, Not Just Accuracy
- Instead of focusing on whether AI gets an answer right, measure how well it adapts to changing situations.
Support New Testing Standards
- Get involved in shaping AI evals that reflect real-world challenges, like OpenAI’s Evals or Hugging Face’s testing initiatives.

Final Thoughts: Tetris as a Business Lesson

The Tetris challenge isn’t just about games—it’s a metaphor for real-world complexity. If AI can’t handle an unpredictable, fast-moving scenario like Tetris, how can businesses trust it with critical operations?

The key takeaway? AI evals should reflect real-world demands, not just theoretical knowledge. Until they do, companies must test AI rigorously before relying on it for mission-critical tasks.

Final Takeaway: “If your AI can’t handle Tetris, think twice before trusting it with your business.”