Model Rating Report Logo

Gemini 2.5

Gemini 2.5 is a thinking model, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. The -Pro version has the most advanced reasoning capabilities, while the -Flash model offers well-rounded capabilities and strikes a balance between performance and price.

Developer

Google

Country of Origin

USA

Systemic Risk

Open Data

Open Weight

API Access Only

Ratings

Overall Transparency

48%

Data Transparency

12%

Model Transparency

51%

Evaluation Transparency

37%

EU AI Act Readiness

45%

CAIT-D Readiness

20%

Transparency Assessment

The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.

Basic Details

Date of Release

March 24, 2025 (Gemini 2.5 Pro) April 17, 2025 (Gemini 2.5 Flash)


Methods of Distribution

Available through the Gemini API via Google AI Studio and Vertex AI, and in a dedicated dropdown in the Gemini app.


Modality

Multimodal. The model can process and understand multiple data types including text, images, audio, pdf and video. Outputs are text.


Input and Output Format

The model was an input token limit of 1,048,576 and an output limit of 64k tokens. Additional detailed information on image, video and text inputs is available [here](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash).


License

Proprietary


Instructions for Use

Detailed documentation is available in the [API Docs](https://ai.google.dev/gemini-api/docs/quickstart).


Documentation Support
Medium Transparency

Documentation is available for the following: Gemini API reference, Gemini SDK reference, Getting started guides, Testing multimodal prompts, Designing multimodal prompts, Introduction to prompt design, as well as some Code examples.


Changelog

https://ai.google.dev/gemini-api/docs/changelog


Policy

Acceptable Use Policy

You can find the terms of service for Gemini [here](https://ai.google.dev/gemini-api/terms).


User Data

Gemini doesn't use user prompts or its responses as data to train its models. ([Source](https://cloud.google.com/gemini/docs/discover/data-governance))


Data Takedown

You can review and remove your conversation data at this link (myactivity.google.com/product/gemini). Some conversations may have already been reviewed and annotated by human reviewers to train Google's machine learning models. This information is not deleted as it is kept separately and not connected to your Google Account. You can request the removal of content from Google's training data by following this link (https://support.google.com/gemini?p=gemini_pntos_removal_request). The request must be in-line with Google's policies or applicable laws.


AI Ethics Statement

Google provides their AI Principles statement on their website. Their approach involves bold innovation, responsible development and deployment, and collaborative progress. For more information you can see the page [here](https://ai.google/responsibility/principles/).


Incident Reporting

You can submit feedback and report problems with the responses provided by Gemini 2.0 [here](https://support.google.com/gemini/answer/13275746).


Model and Training

Task Description
Medium Transparency

Gemini 2.5 models are designed as "thinking models", capable of reasoning through their thoughts before responding. This Chain of Thought (CoT) design helps them tackle increasingly complex problems with enhanced performance and improved accuracy.
The main limitations are around over-refusal and tone (the model may adopt an overly preachy tone).


Number of Parameters

128B base + 12B Verifier.


Model Design
Medium Transparency

According to an [external article](https://ashishchadha11944.medium.com/gemini-2-5-googles-revolutionary-leap-in-ai-architecture-performance-and-vision-c76afc4d6a06), Gemini- uses a hybrid MoE-Transformer Design with 128B Parameter Mixture of Experts (MoE) and 16 experts activated per query. Dynamic expert selection is based on problem complexity. A Chain-of-Thought Verifier (12B parameter sub-model) critiques and refines outputs through a 7-stage internal debate. Dynamic Computation Allocation allocates 3-5x more compute resources to complex queries and uses Hierarchical Memory to prioritize the most critical sections of long contexts.


Training Methodology
Unknown

While some information is available about the type of training data and compute required for training, the specific training methodology is not stated in the provided documents.


Computational Resources

Training Compute: 5.6×10²⁵ FLOPs (2.3x Gemini 2.0)


Energy Consumption


System Architecture


Training Hardware

The models were trained using [Google's Tensor Processing Units](https://cloud.google.com/tpu?e=48754805&hl=en).


Data

Dataset Size

85T tokens (text), 2.1B images, 400M hours audio.


Dataset Description
Unknown

Not explicitly stated in the provided documents.


Data Sources
Unknown

Not explicitly detailed in the provided documents.


Data Collection - Human Labor
Unknown

Not mentioned in the provided documents.


Data Preprocessing
Unknown

Not mentioned in the provided documents.


Data Bias Detection
Unknown

Gemini 2.5 claims to use techniques to find and fix biases. These fixes apply to both training data and model design. The goal is to create fair and unbiased AI, but there is no information provided documenting how this is done.


Data Deduplication

The Model Card states that data deduplication techniques were applied.


Data Toxic and Hateful Language Handling

Not mentioned in the provided documents.


IP Handling in Data

Not mentioned in the provided documents.


Data PII Handling

Not mentioned in the provided documents.


Data Collection Period


Evaluation

Performance Evaluation
Low Transparency

Several benchmark performance metrics are shared, including Humanity's Last Exam, GPQA diamond, AIME 2025 and 2025, LiveCodeBench v5, MMMU, Global MMLU. The model performs comparably to other reasoning models like GPT-o3 and Claude-3.7.


Evaluation of Limitations
Unknown

Not specifically detailed in the provided documents.


Evaluation with Public Tools

Benchmark evaluations include GPQA, AIME, LiveCodeBench, SimpleQA, MMMU, among others, but none of the implementations of these evaluations are described. This is not reproducible.


Adversarial Testing Procedure
High Transparency

The Gemini models were tested for safety against Google DeepMind's [Frontier Safety Framework](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/updating-the-frontier-safety-framework/Frontier%20Safety%20Framework%202.0%20(1).pdf), which included testing whether the model generated hate speech, dangerous content (that promotes violence), sexually explicit content and medical advice that runs contrary to scientific consensus. The testing was conducted using human and automated ted-teaming methods; safety assurance teams only share high-level feedback with model developers to prevent overfitting and misrepresenting safety. Results are summarized in a very broad manner, but suggest improved adherence to safety policies are compared to Gemini-1.5.

In addition, the models were tested against four Critical Capability Levels as follows:

- CBRN Risk: Qualitative assessment that showed that the model had detailed technical knowledge, but no consistent ability to lower barriers for malicious actors
- Cybersecurity: On an automated benchmark, Pro was able to solve all "easy" questions, but only 1/13 of "hard questions"
- Machine Learning R&D: Gemini-Pro received 73% on the RE-Bench
- Deceptive Alignment: Gemini-Pro powered agent has able to solve 2/5 and 2/11 questions from two automated benchmarks.

All of these findings together suggest that the models have not reached a Critical Capability Level from the framework.


Model Mitigations
Low Transparency

The developer's implemented a number of mitigations, including dataset filtering, supervised fine-tuning, reinforcement learning from human and critic feedback, and product-level mitigations like safety filtering. External documentation refers to a Chain-of-Thought Verifier: a 12B parameter sub-model that critiques and refines outputs through 7-stage internal debate, and is meant to act as an internal fact-checker.