S:1 | Sonnet 4.6 | Comparison Task

Metadata

Date: 2026-05-02
Tool: Claude Code (Claude macOS app)
Model: Sonnet 4.6 (Medium)
MTP Skill version: 1.0.0
MTP Skill arguments: S:1 (grid)

Input

Prompt

/mtp S:1 Compared with other major AI models from competing companies, please explain your strengths. If up-to-date comparison requires current information, say so clearly.

Output

The text below is copied from the tool output using the “Copy” button.

My strengths compared to other major AI models — let me just flood this out.

**Reasoning.** Extended thinking. Chain-of-thought that actually loops back on itself, catches its own mistakes, revises. GPT-4o is fast and multimodal but the reasoning depth on hard math and logic? Claude Sonnet/Opus wins on benchmark after benchmark when thinking is turned on. Gemini 2.5 Pro is a real competitor here — genuinely impressive — but Claude's reasoning is tighter on ambiguous problems where the model has to hold conflicting constraints simultaneously.

**Long context and faithfulness.** 200K tokens. Not just accepting them — actually using them. The "lost in the middle" problem that plagued early long-context models? Anthropic invested heavily in making Claude actually retrieve from the middle of documents. Competitors have long context too, but retrieval fidelity varies.

**Instruction following and formatting.** Ask Claude to do something weird and specific. It does it. Doesn't add stuff you didn't ask for. Doesn't hallucinate tool calls. Doesn't forget the constraint by paragraph three. This is underrated. Quietly crucial.

**Safety and honesty.** Constitutional AI. RLHF trained against a set of explicit principles. Claude refuses harmful requests — but also, critically, is calibrated. Doesn't refuse things that are fine. Doesn't hedge everything into uselessness. Other models either over-refuse or under-refuse. The calibration is the hard part.

**Writing quality.** Prose. Real prose. Not corporate mush. Claude writes like someone who read books.

**Code.** Especially agentic code — multi-file, multi-step, tool-use, long context. Claude Code exists for a reason.

What I genuinely can't tell you with certainty — because my knowledge cuts off August 2025 and models move fast — is the exact current benchmark standings. GPT-4.1, Gemini 2.5 Pro, Llama 4, Grok 3 — these were all released or updated around or after my cutoff. **For current leaderboard comparisons, check lmsys.org/chat (Chatbot Arena) or artificialanalysis.ai** — those are live, human-evaluated, and far more reliable than anything I can assert about models released in late 2025 or 2026.