Model Rating Report
Gemini 2.5
Gemini 2.5 is a thinking model, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. The -Pro version has the most advanced reasoning capabilities, while the -Flash model offers well-rounded capabilities and strikes a balance between performance and price.
Developer
Country of Origin
USA
Systemic Risk
Open Data
Open Weight
API Access Only
Ratings
Overall Transparency
48%
Data Transparency
12%
Model Transparency
51%
Evaluation Transparency
37%
EU AI Act Readiness
45%
CAIT-D Readiness
20%
Transparency Assessment
The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.
Sources
Release Announcement: https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-pro
Developer Docs: https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash
API Reference: https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.5-flash-preview-04-17
Model Card (Flash): https://storage.googleapis.com/model-cards/documents/gemini-2.5-flash-preview.pdf
Model Card (Pro): https://storage.googleapis.com/model-cards/documents/gemini-2.5-pro-preview.pdf
External Blog Post: https://ashishchadha11944.medium.com/gemini-2-5-googles-revolutionary-leap-in-ai-architecture-performance-and-vision-c76afc4d6a06
Basic Details
Date of Release
March 24, 2025 (Gemini 2.5 Pro) April 17, 2025 (Gemini 2.5 Flash)
Methods of Distribution
Available through the Gemini API via Google AI Studio and Vertex AI, and in a dedicated dropdown in the Gemini app.
Modality
Multimodal. The model can process and understand multiple data types including text, images, audio, pdf and video. Outputs are text.
Input and Output Format
The model was an input token limit of 1,048,576 and an output limit of 64k tokens. Additional detailed information on image, video and text inputs is available [here](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash).
License
Proprietary
Instructions for Use
Detailed documentation is available in the [API Docs](https://ai.google.dev/gemini-api/docs/quickstart).
Documentation Support
Medium Transparency
Documentation is available for the following: Gemini API reference, Gemini SDK reference, Getting started guides, Testing multimodal prompts, Designing multimodal prompts, Introduction to prompt design, as well as some Code examples.
Changelog
https://ai.google.dev/gemini-api/docs/changelog
Policy
Acceptable Use Policy
You can find the terms of service for Gemini [here](https://ai.google.dev/gemini-api/terms).
User Data
Gemini doesn't use user prompts or its responses as data to train its models. ([Source](https://cloud.google.com/gemini/docs/discover/data-governance))
Data Takedown
You can review and remove your conversation data at this link (myactivity.google.com/product/gemini). Some conversations may have already been reviewed and annotated by human reviewers to train Google's machine learning models. This information is not deleted as it is kept separately and not connected to your Google Account. You can request the removal of content from Google's training data by following this link (https://support.google.com/gemini?p=gemini_pntos_removal_request). The request must be in-line with Google's policies or applicable laws.
AI Ethics Statement
Google provides their AI Principles statement on their website. Their approach involves bold innovation, responsible development and deployment, and collaborative progress. For more information you can see the page [here](https://ai.google/responsibility/principles/).
Incident Reporting
You can submit feedback and report problems with the responses provided by Gemini 2.0 [here](https://support.google.com/gemini/answer/13275746).
Model and Training
Task Description
Medium Transparency
Gemini 2.5 models are designed as "thinking models", capable of reasoning through their thoughts before responding. This Chain of Thought (CoT) design helps them tackle increasingly complex problems with enhanced performance and improved accuracy.
The main limitations are around over-refusal and tone (the model may adopt an overly preachy tone).
Number of Parameters
128B base + 12B Verifier.
Model Design
Medium Transparency
According to an [external article](https://ashishchadha11944.medium.com/gemini-2-5-googles-revolutionary-leap-in-ai-architecture-performance-and-vision-c76afc4d6a06), Gemini- uses a hybrid MoE-Transformer Design with 128B Parameter Mixture of Experts (MoE) and 16 experts activated per query. Dynamic expert selection is based on problem complexity. A Chain-of-Thought Verifier (12B parameter sub-model) critiques and refines outputs through a 7-stage internal debate. Dynamic Computation Allocation allocates 3-5x more compute resources to complex queries and uses Hierarchical Memory to prioritize the most critical sections of long contexts.
Training Methodology
Unknown
While some information is available about the type of training data and compute required for training, the specific training methodology is not stated in the provided documents.
Computational Resources
Training Compute: 5.6×10²⁵ FLOPs (2.3x Gemini 2.0)
Energy Consumption
System Architecture
Training Hardware
The models were trained using [Google's Tensor Processing Units](https://cloud.google.com/tpu?e=48754805&hl=en).
Data
Dataset Size
85T tokens (text), 2.1B images, 400M hours audio.
Dataset Description
Unknown
Not explicitly stated in the provided documents.
Data Sources
Unknown
Not explicitly detailed in the provided documents.
Data Collection - Human Labor
Unknown
Not mentioned in the provided documents.
Data Preprocessing
Unknown
Not mentioned in the provided documents.
Data Bias Detection
Unknown
Gemini 2.5 claims to use techniques to find and fix biases. These fixes apply to both training data and model design. The goal is to create fair and unbiased AI, but there is no information provided documenting how this is done.
Data Deduplication
The Model Card states that data deduplication techniques were applied.
Data Toxic and Hateful Language Handling
Not mentioned in the provided documents.
IP Handling in Data
Not mentioned in the provided documents.
Data PII Handling
Not mentioned in the provided documents.
Data Collection Period
Evaluation
Performance Evaluation
Low Transparency
Several benchmark performance metrics are shared, including Humanity's Last Exam, GPQA diamond, AIME 2025 and 2025, LiveCodeBench v5, MMMU, Global MMLU. The model performs comparably to other reasoning models like GPT-o3 and Claude-3.7.
Evaluation of Limitations
Unknown
Not specifically detailed in the provided documents.
Evaluation with Public Tools
Benchmark evaluations include GPQA, AIME, LiveCodeBench, SimpleQA, MMMU, among others, but none of the implementations of these evaluations are described. This is not reproducible.
Adversarial Testing Procedure
High Transparency
The Gemini models were tested for safety against Google DeepMind's [Frontier Safety Framework](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/updating-the-frontier-safety-framework/Frontier%20Safety%20Framework%202.0%20(1).pdf), which included testing whether the model generated hate speech, dangerous content (that promotes violence), sexually explicit content and medical advice that runs contrary to scientific consensus. The testing was conducted using human and automated ted-teaming methods; safety assurance teams only share high-level feedback with model developers to prevent overfitting and misrepresenting safety. Results are summarized in a very broad manner, but suggest improved adherence to safety policies are compared to Gemini-1.5.
In addition, the models were tested against four Critical Capability Levels as follows:
- CBRN Risk: Qualitative assessment that showed that the model had detailed technical knowledge, but no consistent ability to lower barriers for malicious actors
- Cybersecurity: On an automated benchmark, Pro was able to solve all "easy" questions, but only 1/13 of "hard questions"
- Machine Learning R&D: Gemini-Pro received 73% on the RE-Bench
- Deceptive Alignment: Gemini-Pro powered agent has able to solve 2/5 and 2/11 questions from two automated benchmarks.
All of these findings together suggest that the models have not reached a Critical Capability Level from the framework.
Model Mitigations
Low Transparency
The developer's implemented a number of mitigations, including dataset filtering, supervised fine-tuning, reinforcement learning from human and critic feedback, and product-level mitigations like safety filtering. External documentation refers to a Chain-of-Thought Verifier: a 12B parameter sub-model that critiques and refines outputs through 7-stage internal debate, and is meant to act as an internal fact-checker.