Newsletter Subscribe
Enter your email address below and subscribe to our newsletter
Enter your email address below and subscribe to our newsletter
This article contains affiliate links. We may earn a commission at no extra cost to you. Full disclosure.
Framework for tracking AI breakthroughs, funding rounds, and policy changes — stay ahead of the curve.
The most significant shift in the 2024 AI model market wasn’t a leap in raw intelligence — it was a brutal price war. Between February and June, the cost of processing one million tokens through a top-tier API dropped by over 60%, while standard benchmark scores like MMLU improved by less than 5 percentage points year-over-year. This year’s releases prioritized multimodal speed, longer context windows, and open-weight accessibility over chasing ever-higher reasoning scores. The result is a fragmented landscape where the “best” model depends heavily on your budget, latency tolerance, and willingness to self-host. Below, I break down the five most consequential releases — GPT‑4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3, and Mistral Large — with hard numbers on performance, pricing, and real-world trade-offs. I’ll separate what the papers actually show from what the marketing decks imply.
OpenAI’s GPT‑4o, released in May 2024, is the company’s first natively multimodal model — it processes text, images, and audio through a single transformer, not separate modules. The headline claim is a 2× speed improvement over GPT‑4 Turbo, but the benchmark scores tell a more modest story. On MMLU (5‑shot), GPT‑4o scores 88.7%, barely 0.3 points above GPT‑4 Turbo’s 88.4%. On HumanEval, it reaches 90.2%, a solid but not revolutionary improvement over the 87.0% of its predecessor. On MATH, it hits 76.6%, up from 72.6% for GPT‑4 Turbo.
The real value of GPT‑4o lies in its unified API. You can send an image and ask for a description in the same call, with latency roughly halved compared to chaining separate vision and text models. Pricing dropped to $5 per million input tokens and $15 per million output tokens — half the cost of GPT‑4 Turbo. Context length remains 128K tokens. Training compute is undisclosed, but estimates based on the model’s size (likely around 1.8 trillion parameters under a mixture‑of‑experts architecture) put it at roughly 2×1025 FLOPs. The caveat: the multimodal speed boost is real, but for pure text reasoning tasks, you’re paying a premium for a feature you may not use.
Anthropic’s Claude 3.5 Sonnet, launched in June 2024, matches GPT‑4o on MMLU at 88.7% but pulls ahead on coding benchmarks. It scores 92.0% on HumanEval and 78.5% on MATH, making it the strongest coder in this lineup. Context length is 200K tokens — a 50% increase over GPT‑4o — and input pricing is lower at $3 per million tokens (output stays at $15 per million). The model is also faster than its predecessor Claude 3 Opus, with a 2× latency improvement.
Top-rated VPN for online privacy and security. Lightning-fast servers.
Affiliate link
Anthropic’s emphasis on constitutional AI and safety training means Claude 3.5 Sonnet frequently refuses requests that other models handle — a double‑edged sword. In my testing, it refused to generate a simple Python script that could be misused for web scraping, while GPT‑4o complied without hesitation. The safety filters are more aggressive than any other provider’s, which can frustrate developers working on legitimate automation tasks. Parameter count is undisclosed, but industry estimates place it around 200 billion parameters (dense). Training compute is also unknown, but the model’s efficiency suggests a smaller footprint than GPT‑4o.
For teams prioritizing code generation and security compliance, Claude 3.5 Sonnet is the best option. But expect more rejections than with GPT‑4o, especially for tasks involving data extraction or content generation in sensitive domains.
Google DeepMind’s Gemini 1.5 Pro, released in February 2024, introduced a 1‑million‑token context
Framework for tracking AI breakthroughs, funding rounds, and policy changes — stay ahead of the curve.
No spam. Unsubscribe anytime.