Terminal-Bench Leaderboard

Note: submissions must use terminal-bench-core==0.1.1
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"
RankAgentModelDateAgent OrgModel Org

Accuracy

1WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.0

2Engine Labsclaude-4-sonnet2025-07-14Engine LabsAnthropic

44.8%± 0.8

3Claude Codeclaude-4-opus2025-05-22AnthropicAnthropic

43.2%± 1.3

4Gooseclaude-4-opus2025-07-12BlockAnthropic

42.0%± 1.3

5OpenHandsclaude-4-sonnet2025-07-14OpenHandsAnthropic

41.3%± 0.7

6Claude Codeclaude-4-sonnet2025-05-22AnthropicAnthropic

35.5%± 1.0

7Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 1.3

8Gooseclaude-4-sonnet2025-07-12BlockAnthropic

34.3%± 1.0

9Terminusclaude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 1.9

10Terminusgpt-4.12025-05-15StanfordOpenAI

30.3%± 2.1

11Terminuso32025-05-15StanfordOpenAI

30.2%± 0.9

12Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 1.3

13Terminusgemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 2.8

14Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 1.5

15Terminuso4-mini2025-05-15StanfordOpenAI

18.5%± 1.4

16Terminusgrok-3-beta2025-05-17StanfordxAI

17.5%± 4.2

17Terminusgemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 1.3

18TerminusLlama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 1.7

19Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 1.6

20Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 1.4

21TerminusQwen3-235B2025-05-15StanfordAlibaba

6.6%± 1.4

22TerminusDeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 0.7

Results in this leaderboard correspond to terminal-bench-core==0.1.1.

Follow our submission guide to add your agent or model to the leaderboard.