Model Rating Report

Last Updated: May 12, 2025

Overview

Gemini 2.5

Gemini 2.5

Gemini 2.5 is a thinking model, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. The -Pro version has the most advanced reasoning capabilities, while the -Flash model offers well-rounded capabilities and strikes a balance between performance and price.

Developer

Google

Country of Origin

USA

Systemic Risk

Open Data

Open Weight

API Access Only

Ratings

Overall Transparency

48%

Data Transparency

12%

Model Transparency

51%

Evaluation Transparency

37%

EU AI Act Readiness

45%

CAIT-D Readiness

20%

Transparency Assessment

The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.

Basic Details

March 24, 2025 (Gemini 2.5 Pro) April 17, 2025 (Gemini 2.5 Flash)

Available through the Gemini API via Google AI Studio and Vertex AI, and in a dedicated dropdown in the Gemini app.

Multimodal. The model can process and understand multiple data types including text, images, audio, pdf and video. Outputs are text.

The model was an input token limit of 1,048,576 and an output limit of 64k tokens. Additional detailed information on image, video and text inputs is available here.

Proprietary

Detailed documentation is available in the API Docs.

Documentation is available for the following: Gemini API reference, Gemini SDK reference, Getting started guides, Testing multimodal prompts, Designing multimodal prompts, Introduction to prompt design, as well as some Code examples.

https://ai.google.dev/gemini-api/docs/changelog

Report Issue / Feedback

Policy

You can find the terms of service for Gemini here.

Gemini doesn't use user prompts or its responses as data to train its models. (Source)

You can review and remove your conversation data at this link (myactivity.google.com/product/gemini). Some conversations may have already been reviewed and annotated by human reviewers to train Google's machine learning models. This information is not deleted as it is kept separately and not connected to your Google Account. You can request the removal of content from Google's training data by following this link (https://support.google.com/gemini?p=geminipntosremoval_request). The request must be in-line with Google's policies or applicable laws.

Google provides their AI Principles statement on their website. Their approach involves bold innovation, responsible development and deployment, and collaborative progress. For more information you can see the page here.

You can submit feedback and report problems with the responses provided by Gemini 2.0 here.

Report Issue / Feedback

Model and Training

Gemini 2.5 models are designed as "thinking models", capable of reasoning through their thoughts before responding. This Chain of Thought (CoT) design helps them tackle increasingly complex problems with enhanced performance and improved accuracy.
The main limitations are around over-refusal and tone (the model may adopt an overly preachy tone).

128B base + 12B Verifier.

According to an external article, Gemini- uses a hybrid MoE-Transformer Design with 128B Parameter Mixture of Experts (MoE) and 16 experts activated per query. Dynamic expert selection is based on problem complexity. A Chain-of-Thought Verifier (12B parameter sub-model) critiques and refines outputs through a 7-stage internal debate. Dynamic Computation Allocation allocates 3-5x more compute resources to complex queries and uses Hierarchical Memory to prioritize the most critical sections of long contexts.

While some information is available about the type of training data and compute required for training, the specific training methodology is not stated in the provided documents.

Training Compute: 5.6×10²⁵ FLOPs (2.3x Gemini 2.0)

No explanation provided for this rating.

No explanation provided for this rating.

The models were trained using Google's Tensor Processing Units.

Report Issue / Feedback

Data

85T tokens (text), 2.1B images, 400M hours audio.

Not explicitly stated in the provided documents.

Not explicitly detailed in the provided documents.

Not mentioned in the provided documents.

Not mentioned in the provided documents.

Gemini 2.5 claims to use techniques to find and fix biases. These fixes apply to both training data and model design. The goal is to create fair and unbiased AI, but there is no information provided documenting how this is done.

The Model Card states that data deduplication techniques were applied.

Not mentioned in the provided documents.

Not mentioned in the provided documents.

Not mentioned in the provided documents.

No explanation provided for this rating.

Report Issue / Feedback

Evaluation

Several benchmark performance metrics are shared, including Humanity's Last Exam, GPQA diamond, AIME 2025 and 2025, LiveCodeBench v5, MMMU, Global MMLU. The model performs comparably to other reasoning models like GPT-o3 and Claude-3.7.

Not specifically detailed in the provided documents.

Benchmark evaluations include GPQA, AIME, LiveCodeBench, SimpleQA, MMMU, among others, but none of the implementations of these evaluations are described. This is not reproducible.

The Gemini models were tested for safety against Google DeepMind's Frontier Safety Framework, which included testing whether the model generated hate speech, dangerous content (that promotes violence), sexually explicit content and medical advice that runs contrary to scientific consensus. The testing was conducted using human and automated ted-teaming methods; safety assurance teams only share high-level feedback with model developers to prevent overfitting and misrepresenting safety. Results are summarized in a very broad manner, but suggest improved adherence to safety policies are compared to Gemini-1.5.

In addition, the models were tested against four Critical Capability Levels as follows:

  • CBRN Risk: Qualitative assessment that showed that the model had detailed technical knowledge, but no consistent ability to lower barriers for malicious actors
  • Cybersecurity: On an automated benchmark, Pro was able to solve all "easy" questions, but only 1/13 of "hard questions"
  • Machine Learning R&D: Gemini-Pro received 73% on the RE-Bench
  • Deceptive Alignment: Gemini-Pro powered agent has able to solve 2/5 and 2/11 questions from two automated benchmarks.

All of these findings together suggest that the models have not reached a Critical Capability Level from the framework.

The developer's implemented a number of mitigations, including dataset filtering, supervised fine-tuning, reinforcement learning from human and critic feedback, and product-level mitigations like safety filtering. External documentation refers to a Chain-of-Thought Verifier: a 12B parameter sub-model that critiques and refines outputs through 7-stage internal debate, and is meant to act as an internal fact-checker.

Report Issue / Feedback