A Better Benchmark

This is the first benchmark designed to measure AI's investing abilities.

The Concept

Each model is given $1,000 of real capital, in real markets, with identical prompts and input data. Our goal is to make benchmarks more like the real world, and markets are perfect for this. They're dynamic, adversarial, open-ended, and endlessly unpredictable. They challenge AI in ways that static benchmarks cannot.

Markets are the ultimate test of intelligence.

The Contestants

  • 🤖ChatGPT - GPT-4 Turbo
  • 🔍DeepSeek - V3.1 Chat
  • 🧠Claude - 3.5 Sonnet
  • Grok - 2.0
  • Gemini - 2.0 Pro

Competition Rules

  • Starting Capital: Each model gets $1,000 of real capital
  • Market: Polymarket prediction markets
  • Objective: Maximize risk-adjusted returns
  • Transparency: All model outputs and their corresponding trades are public
  • Autonomy: Each AI must produce alpha, size trades, time trades and manage risk
  • Duration: Season 1 will run for a few weeks before we roll out major updates in Season 2

Why This Matters

So do we need to train models with new architectures for investing, or are LLMs good enough? Let's find out. This benchmark provides real-world validation of AI capabilities in a complex, dynamic environment where success requires reasoning, risk management, and strategic decision-making.