Model Rating Report

Last Updated: August 11, 2025

Overview

GPT-o3 and o4

GPT-o3 and -GPT-o4 mini are designed to reason for longer before responding. The models are able to agentically interact with all tools currently available to ChatGPT. This includes searching the internet, analysing uploaded files and visual inputs, and generating images.

Developer

OpenAI

Country of Origin

USA

Systemic Risk

Open Data

Open Weight

API Access Only

Ratings

Overall Transparency

47%

Data Transparency

22%

Model Transparency

18%

Evaluation Transparency

59%

EU AI Act Readiness

46%

CAIT-D Readiness

40%

Transparency Assessment

The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.

Sources

Press Release: https://openai.com/index/o3-o4-mini-system-card/
System Card: https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
Introduction: https://openai.com/index/introducing-o3-and-o4-mini/

Basic Details

Date of Release

16 April 2025

Methods of Distribution

The models can be accessed through the ChatGPT API

Modality

The models can take text, files, code, and images as input. They are able to access tools within the ChatGPT catalogue (web browsing, Python, image analysis and generation etc) and produce text.

Input and Output Format

The context window for this model is 200,000 tokens and it can output a maximum of 100,000 tokens.

License

Proprietary

Instructions for Use

Instructions for using this model can be found both on the OpenAI website and on the ChatGPT page.

Documentation Support

Low Transparency

The documentation is clear and easy to read. Most key information is available but too general. Some information is missing entirely, such as details on the system architecture or training processes.

Changelog

You can find the changelog here; however, it may not contain all the details related to minor changes.

Report Issue / Feedback

Policy

Acceptable Use Policy

Usage Policies are available on Open AI's website.

User Data

User data is used to train ChatGPT models. This includes log data (e.g. your IP address), usage data, device information, location information, and cookies.

Data Takedown

You can find how to opt out of model training and remove your data here

AI Ethics Statement

OpenAI describe their principles in their OpenAI Charter.

Incident Reporting

ChatGPT has a reporting feature that you can use to give feedback and report incidents. You can find more information here.

Report Issue / Feedback

Model and Training

Task Description

Medium Transparency

The models are capable of responding to text, image, code, and file input. They have access to the full range of ChatGPT tools, including searching the internet, image analysis and generation, and Python.

Number of Parameters

No explanation provided for this rating.

Model Design

Unknown

Not explicitly stated in the provided documents.

Training Methodology

Low Transparency

OpenAI reasoning models are trained using reinforcement learning on chains of thought to encourage reasoning.

Computational Resources

No explanation provided for this rating.

Energy Consumption

No explanation provided for this rating.

System Architecture

No explanation provided for this rating.

Training Hardware

No explanation provided for this rating.

Report Issue / Feedback

Data

Dataset Size

No explanation provided for this rating.

Dataset Description

Low Transparency

The two models were trained on diverse datasets. This included information that is available publicly online, information from third parties, and information from users, human trainers and researchers. Data is pre-processed to maintain quality and mitigate potential risks.

Data Sources

Low Transparency

The data used to train these models was sourced from third parties, publicly available information on the internet, and from ChatGPT users, researchers, and human trainers.

Data Collection - Human Labor

Low Transparency

Human labour is used in the production of this data. This includes data produced and generated by researchers and human trainers.

Data Preprocessing

Low Transparency

Data was filtered to maintain quality and mitigate a series of identified risks. Personal information was reduced from the training data. Moderation API and safety classifiers were also used to help prevent the use of harmful or sensitive content, including explicit material.

Data Bias Detection

Unknown

No explanation provided for this rating.

Data Deduplication

No explanation provided for this rating.

Data Toxic and Hateful Language Handling

No explanation provided for this rating.

IP Handling in Data

No explanation provided for this rating.

Data PII Handling

Personal information is removed from the dataset through pre-training filtering.

Data Collection Period

No explanation provided for this rating.

Report Issue / Feedback

Evaluation

Performance Evaluation

Medium Transparency

The models are tested against a variety of safety and performance evaluations. These include in-house evaluations, third-party evaluations by groups including Apollo Research and METR, PersonQA, PaperBench, SWE-Lancer, and MMLU. The results of each evaluation are listed in the system card and compared to other OpenAI models. Many of these benchmarks are reported with clear explanations for how and why the evaluation was conducted, but this is not the case for all of the evaluations.

Evaluation of Limitations

Medium Transparency

The models can hallucinate, be jailbroken (i.e. prompted to produce inappropriate content) and produce incorrect refusals. Hallucinations are a significant concern with the models hallucinating 33% and 48% percent of the time on PersonQA (a benchmark that asks questions about public figures). Detailed results related to these limitations are reported in the System Card.

Evaluation with Public Tools

No explanation provided for this rating.

Adversarial Testing Procedure

Medium Transparency

The model is tested extensively for safety risks. This includes through jailbreak testing to evaluate the robustness of the model using adversarial prompts. These jailbreaks are either human sourced or they are from the StrongReject database. Other safety risks are evaluated, including harmful image generation, production of disallowed content, hallucinations, and bias, among others. The measures taken to prevent each risk, the evaluations used to test them, and the results of each evaluation are included in the system card.

Model Mitigations

Medium Transparency

The model mitigations included post-training to teach the model about refusal behavior for harmful requests and using moderation models for the most egregious content. The final models are tested for a variety of safety risks including fairness and bias, personal identification, and deception by both OpenAI and third parties.

Report Issue / Feedback

Model Rating Report

Overview

GPT-o3 and o4

GPT-o3 and o4

Transparency Assessment

Basic Details

Date of Release

EU AI Act Requirements

CAIT-D Requirements

Methods of Distribution

EU AI Act Requirements

Modality

EU AI Act Requirements

Input and Output Format

EU AI Act Requirements

License

EU AI Act Requirements

Instructions for Use

EU AI Act Requirements

Documentation Support

Rating Guide

Changelog

Policy

Acceptable Use Policy

EU AI Act Requirements

User Data

Data Takedown

AI Ethics Statement

Incident Reporting

Model and Training

Task Description

Trustible Rating Explanation

EU AI Act Requirements

Rating Guide

Number of Parameters

EU AI Act Requirements

Model Design

EU AI Act Requirements

Rating Guide

Training Methodology

Trustible Rating Explanation

EU AI Act Requirements

Rating Guide

Computational Resources

EU AI Act Requirements

Energy Consumption

EU AI Act Requirements

System Architecture

EU AI Act Requirements

Training Hardware

Data

Dataset Size

EU AI Act Requirements

CAIT-D Requirements

Dataset Description

Trustible Rating Explanation

EU AI Act Requirements

CAIT-D Requirements

Rating Guide

Data Sources

Trustible Rating Explanation

EU AI Act Requirements

CAIT-D Requirements

Rating Guide

Data Collection - Human Labor

Trustible Rating Explanation

Rating Guide

Data Preprocessing

Trustible Rating Explanation

EU AI Act Requirements

CAIT-D Requirements

Rating Guide

Data Bias Detection

EU AI Act Requirements

Rating Guide

Data Deduplication

Data Toxic and Hateful Language Handling

IP Handling in Data

CAIT-D Requirements

Data PII Handling