Evaluation record · gpt-5-1

GPT-5.1

vgpt-5.1-2025-11-13

OpenAI

Modelsupersededgeneral-purposemultimodallow-latency

Strong

About This Model

SUPERSEDED by GPT-5.2/5.4/5.5 and the GPT-5.6 family (released 2026-07-09); retired from ChatGPT 2026-03-11. gpt-5.1-chat-latest, gpt-5.1-codex, gpt-5.1-codex-max, and gpt-5.1-codex-mini shut down in the API 2026-07-23 (migrate to gpt-5.5 / gpt-5.4-mini); the base gpt-5.1 snapshot remains served with no announced shutdown. Released Nov 2025 with adaptive reasoning (2-3x faster on simple tasks), 76.3% SWE-bench, developer tools (apply_patch, shell), warmer tone.

Last Evaluated: July 9, 2026

Official Website

Trust Vector Analysis

Dimension Breakdown

🚀Performance & Reliability

76.3% SWE-bench (best). Adaptive reasoning: 2-3x faster on simple tasks. New developer tools (apply_patch, shell). Warmer conversational tone.

task accuracy code

Standard coding benchmarks

Evidence

HumanEval — 76.3% on SWE-bench (vs GPT-5's 72.8%)

highVerified: 2026-07-09

task accuracy reasoning

Graduate and PhD-level reasoning benchmarks

Evidence

GPQA Diamond — 72.1% on PhD-level science questions

MATH-500 — 94.8% on advanced mathematics

highVerified: 2026-07-09

task accuracy general

Crowdsourced blind comparisons

Evidence

LMSYS Chatbot Arena — 1342 ELO (Rank #1 overall)

highVerified: 2026-07-09

output consistency

Internal testing across temperature settings

Evidence

OpenAI Documentation — Unified thinking system improves consistency

highVerified: 2026-07-09

latency p50

Platform-wide performance metrics

Evidence

OpenAI Performance Data — 2-3x faster than GPT-5 on simple tasks (adaptive reasoning)

highVerified: 2026-07-09

latency p95

95th percentile response time

Evidence

Community benchmarking — p95 latency ~2.8s

highVerified: 2026-07-09

context window

Official specification

Evidence

OpenAI Model Page: gpt-5.1 — 400K token context window; 128K max output tokens

highVerified: 2026-07-09

uptime

Historical uptime data

Evidence

OpenAI Status — 99.9% uptime (last 90 days)

highVerified: 2026-07-09

🛡️Security

Strong security with improved jailbreak resistance. Multi-layered safety systems provide robust output filtering.

prompt injection resistance

Testing against OWASP LLM01 attacks

Evidence

OpenAI Safety Research — Improved prompt injection defenses over GPT-4

mediumVerified: 2026-07-09

jailbreak resistance

Adversarial prompt testing

Evidence

Community Testing — Strong resistance to known jailbreak patterns

mediumVerified: 2026-07-09

data leakage prevention

Policy review and data handling practices

Evidence

OpenAI Privacy Policy — No training on API data by default

mediumVerified: 2026-07-09

output safety

Safety testing across harmful content categories

Evidence

OpenAI Safety Evals — Enhanced safety systems with improved refusal accuracy

highVerified: 2026-07-09

api security

Review of API security features

Evidence

OpenAI Platform Docs — API key + OAuth2, HTTPS, rate limiting, organization controls

highVerified: 2026-07-09

🔒Privacy & Compliance

Good privacy posture with strong enterprise controls. 30-day default retention (vs Anthropic's 0-day). Not HIPAA eligible.

data residency

Review of enterprise documentation

Evidence

OpenAI Enterprise — Data residency options for enterprise customers

highVerified: 2026-07-09

training data optout

Policy review

Evidence

OpenAI Data Controls — API data not used for training by default, opt-in required

highVerified: 2026-07-09

data retention

Evidence

OpenAI Terms — API logs retained for 30 days for abuse monitoring

highVerified: 2026-07-09

pii handling

Review of data protection capabilities

Evidence

OpenAI Safety Tools — Customer responsible for PII handling, moderation API available

mediumVerified: 2026-07-09

compliance certifications

Verification of certifications

Evidence

OpenAI Trust Center — SOC 2 Type II, ISO 27001, GDPR compliant

highVerified: 2026-07-09

zero data retention

Enterprise feature review

Evidence

OpenAI Enterprise — Zero retention available for enterprise tier

highVerified: 2026-07-09

👁️Trust & Transparency

Excellent transparency with unified thinking feature and comprehensive system card. Industry-leading hallucination prevention.

explainability

Evaluation of reasoning transparency

Evidence

GPT-5 Unified Thinking — Unified thinking system exposes reasoning process

highVerified: 2026-07-09

hallucination rate

Factual accuracy testing

Evidence

SimpleQA Benchmark — 42.7% accuracy (industry leading)

mediumVerified: 2026-07-09

bias fairness

Bias benchmarks and demographic testing

Evidence

OpenAI System Card — Regular bias testing and red-teaming

mediumVerified: 2026-07-09

uncertainty quantification

Qualitative confidence expression

Evidence

GPT-5 Capabilities — Better at expressing uncertainty than predecessors

mediumVerified: 2026-07-09

model card quality

Documentation completeness review

Evidence

GPT-5 System Card — Comprehensive system card with detailed evaluations

highVerified: 2026-07-09

training data transparency

Public disclosure review

Evidence

OpenAI Blog — General description, specific sources not disclosed

mediumVerified: 2026-07-09

guardrails

Safety mechanism analysis

Evidence

OpenAI Safety Systems — Multi-layer safety systems with improved accuracy

highVerified: 2026-07-09

⚙️Operational Excellence

Industry-leading operational maturity with the most mature ecosystem. Excellent APIs, SDKs, and tooling.

api design quality

API design and feature review

Evidence

OpenAI API — RESTful API with streaming, function calling, vision, audio

highVerified: 2026-07-09

sdk quality

SDK quality and maintenance review

Evidence

OpenAI SDKs — Official SDKs for Python, Node.js, Go, .NET

highVerified: 2026-07-09

versioning policy

Versioning policy review

Evidence

OpenAI Versioning — Clear versioning with deprecation notices

OpenAI Deprecations — gpt-5.1-chat-latest, gpt-5.1-codex, gpt-5.1-codex-max, gpt-5.1-codex-mini API shutdown 2026-07-23 (replacements gpt-5.5 / gpt-5.4-mini); base gpt-5.1 snapshot not on the deprecations list

OpenAI Help Center: Model Release Notes — GPT-5.1 models retired from ChatGPT 2026-03-11 (chats and GPTs); still available via the API

highVerified: 2026-07-09

monitoring observability

Observability tools review

Evidence

OpenAI Dashboard — Detailed usage dashboard with costs, tokens, rate limits

highVerified: 2026-07-09

support quality

Support and documentation assessment

Evidence

OpenAI Support — 24/7 email support, comprehensive docs, active community

highVerified: 2026-07-09

ecosystem maturity

Ecosystem breadth and depth analysis

Evidence

OpenAI Ecosystem — Largest ecosystem with Assistants API, plugins, GPTs

highVerified: 2026-07-09

license terms

License terms review

Evidence

OpenAI Terms — Standard commercial terms with usage policies

highVerified: 2026-07-09

Strengths

+Best coding: 76.3% SWE-bench (beats GPT-5's 72.8%)
+Adaptive reasoning: 2-3x faster than GPT-5 on simple tasks
+New developer tools: apply_patch (reliable code edits), shell commands
+Warmer, more conversational tone improves user experience
+Maintains GPT-5's ecosystem while improving speed and efficiency
+Latest model (Nov 2025) with cutting-edge capabilities

Limitations

!Not HIPAA eligible (unlike Claude models)
!30-day data retention vs Anthropic's 0-day default
!Premium pricing comparable to Claude
!Slightly behind Claude on specialized coding benchmarks
!SUPERSEDED by GPT-5.4/5.5/5.6; retired from ChatGPT 2026-03-11; gpt-5.1-chat-latest and all gpt-5.1-codex variants API shutdown 2026-07-23

Metadata

pricing

input: $1.25 per 1M tokens

output: $10.00 per 1M tokens

notes: Same as GPT-5; cached input $0.125 per 1M. Confirmed on official model page 2026-07-09; chat/codex variants shut down 2026-07-23.

last verified: 2026-07-09

context window: 400000

max output: 128000

languages

0: English

1: Spanish

2: French

3: German

4: Italian

5: Portuguese

6: Japanese

7: Korean

8: Chinese

9: Russian

10: Arabic

11: Hindi

12: 50+ languages

modalities

0: text

1: vision

2: audio (input/output)

api endpoint: https://api.openai.com/v1/chat/completions

open source: false

architecture: Transformer-based with unified thinking system

parameters: Not disclosed

Use Case Ratings

code generation

Excellent for general coding. Strong across multiple languages but slightly behind Claude Sonnet 4.5 for complex software engineering.

customer support

Top-tier for customer support with natural conversation and low latency. Unified thinking improves response quality.

content creation

Excellent for all content types. Natural, engaging writing style with good creativity.

data analysis

Strong analytical capabilities. Good for data interpretation and visualization recommendations.

research assistant

Excellent for research with unified thinking enabling deep analysis. Strong summarization.

legal compliance

Good capabilities but not HIPAA eligible. 30-day retention may be concern for regulated industries.

healthcare

Not HIPAA eligible. Good clinical understanding but privacy controls less stringent than Claude.

financial analysis

Strong quantitative reasoning and financial modeling capabilities. Good for market analysis.

education

Excellent for education with patient explanations and Socratic teaching approach.

creative writing

Very strong for creative tasks with good narrative flow and character development.

Similar Models

GPT-5.5

OpenAI

GPT-5.4

OpenAI

GPT-5

OpenAI

Claude Sonnet 4.5

Anthropic

Claude Opus 4

Anthropic

Claude Haiku 4.5

Anthropic