Model Rating Report

Last Updated: April 7, 2025

Overview

Llama-3 Family

Llama-3 Family

Llama-3 is a family of open-access of large-language models released by Meta. This page provides an analysis for both the "base" models and the "instruct" models.

Developer

Meta

Country of Origin

USA

Systemic Risk

Open Data

Open Weight

API Access Only

Ratings

Overall Transparency

75%

Data Transparency

60%

Model Transparency

76%

Evaluation Transparency

92%

EU AI Act Readiness

81%

CAIT-D Readiness

76%

Transparency Assessment

The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.

Basic Details

Llama-3 models were released on in April and in July 2024

The models can be accessed through Hugging Face or downloaded directly from the Meta website.

Llama-3 is a text-to-text generative model.

Prompt and output formatting is available in the Hugging Face and Github documentation. All models support sequence length up to 8192 tokens.

Llama-3 is released under a custom license with some limitations for commercial use. The license requires explicit attribution: Derived models, for instance, need to include "Llama 3" at the beginning of their name, and you also need to mention "Built with Meta Llama 3" in derivative works or services.License: https://llama.meta.com/llama3/license/

Instructions for Use are available on Github and on Hugging Face.

No explanation provided for this rating.

Llama-3 is a one-time released model.

Report Issue / Feedback

Policy

The Acceptable Use Policy is available here: https://llama.meta.com/llama3/use-policy/

Meta user data is not used to train Llama models.

No explanation provided for this rating.

Meta detail their approach to responsible AI production here.

Reporting issues with the model: https://github.com/meta-llama/llama-models/issues . Reporting risky content generated by the model: developers.facebook.com/llamaoutputfeedback

Report Issue / Feedback

Model and Training

The base models are intended to be adapted for a variety of natural language and code generation tasks, while the -instruct models are intended to be sued for assistant-like chat. Evaluation on several common reasoning and knowledge benchmarks is published.

The documentation discusses in general terms that some safety protections into the -Instruct models, but inaccurate, biased or other objectionable responses to user prompts are possible. In addition, the documentation states that most evaluation was done on English-language content and consistent performance in other languages is not guaranteed.

The Model family consists of a 8B and 70B parameter variant.

Llama-3 is based on the Llama-2 architecture, that is discussed in detail in that model's paper (see our Llama-2 rating page for more details). Two key architecture changes are included: the vocabulary is expanded to 128k tokens to encode language more effectively and Grouped-Query Attention is used for increased efficiency.

Training process is described in general terms: Pre-training for the base model, followed by a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO) for the Instruction model.

In the release announcement, a short discussion of the pre-training process, like parallelization strategies is provided.

Information about parameters used during training was not identified.

According to the Model Card, the model's were trained on Meta's Research SuperCluster. A total of7.7M GPU hours of computation on hardware of type H100-80GB were used to train the family of models.

According to the Model Card, the estimated total emissions were 2290 tCO2eq.

No explanation provided for this rating.

No explanation provided for this rating.

Report Issue / Feedback

Data

According to the Model Card, Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples.

The pre-training dataset consists of a mix of about 50% general-knowledge tokens (primarily filtered web data), 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.

The post-training preference dataset was collected internally at Meta and involved annotator ranking the strength of their preference into one of four levels. The preference conversations topics included general english interactions (50%), reasoning and tool use (21%), and coding (15%).

The supervised fine-tuning data consisted of human annotated examples, synthetic data and additional curated datasets discussed in the report.

The pre-training data consists of a curated set of web data, collected via an unspecified process. The post-training datasets consisted of human-annotated prompt-output pairs and of new synthetic data that targeted coding, multi-lingual, mathematical, long-context and tool-use capabilities. Details about how the different post-training datasets were collected are available in the report.

The documentation references human-annotated examples used during fine-tuning, but does not provide any details about how these examples were produced.

The pre-training web data was processed using a multi-step process that included PII and safety filtering (by excluding websites that are known to contain harmful and adult content), custom HTML parsing, deduplication (on a URL, Document and Line level), heuristic filtering to remove low quality data (e.g. removing lines that consisted of duplicated content by error messages) and model-based quality filtering (using Llama-2 to do quality assessment). Separate filtering and extraction methods were used for English, multilingual and code/math documents. Post-training data, which was largely synthetically generated, was cleaned to remove undesired observed patterns (e.g.using an overly-apologetic tone).

No preprocessing or analyses relating to bias could be identified.

Deduplication is implemented at a URL, Document and Line level.

Filters are implemented to remove data from domains ranked as harmful by Meta safety standards and domains known to contain adult content.

No explanation provided for this rating.

Filters are implemented to remove data from websites known to contain high volumes of PII.

The knowledge cut-off for the pre-training dataset is at the end of 2023.

Report Issue / Feedback

Evaluation

The performance of the Llama models was evaluated on general knowledge, mathematical reasoning, coding, multi-linguality, tool-use and long-context benchmarks. LLama-3-7B performs similarly to models of the same size, while Llama-3-405B generally outperforms other available open-source model and lags slightly behind GPT-4 and Gemini. The analysis included additional tests for model robustness (e.g. accounting for variability in multiple-choice benchmarks), adversarial benchmarks and a contamination analysis. Additional quality evaluation was conducted by asking human evaluations to rank outputs from Llama-3-405B and GPT-4o on multiple task types. Llama out performed GPT-4o on 2 out of 3 task types.

The safety and helpfulness of the Llama models was measured on an internal benchmark that evaluated violation rates (i.e. the model producing a response to an unsafe prompt) and false refusal rates (i.e. the model refusing to produce a response to a safe prompt). This evaluation was separated out by language and included categories for advanced capabilities like tool usage and long context question-answering. Llama-3 general performed similarly to competitors and maintained low violation and false refusal rates, showing a balance between helpfulness and safety.

https://github.com/meta-llama/llama-models/blob/main/models/llama31/evaldetails.md

A multi-disciplinary red-team measured risks across a 13-topic Safety taxonomy. A combination of manual and automated prompting techniques were used across single and multi-turn conversations for a comprehensive understanding of vulnerabilities. 

Additional adversarial testing measured chemical/biological weapon and cybersecurity risks. For chemical/biological weapon risks, an uplift study was conducted where individuals ability to generate fictitious plans for a chemical/biological attack was evaluated in two set-ups: access to the internet VS access to Llama and the internet. This exercise showed that Llama did not cause a significant uplift in ability to complete malicious tasks. For Cybersecurity, the CyberSecEval benchmark was complemented by a new benchmark for spear phishing and autonomous cyberattacks. The developers found that Llama 3 does not have significant susceptibilities in generating malicious code or exploiting vulnerabilities. In addition, "uplift testing" showed that Llama did not significantly aid either expert or novice individuals in a mock cyberattack task, compared to a cohort with access to the internet alone.

Several types of mitigations were incorporated into the Llama models:


  1. Pre-training data underwent significant filtering to remove unsafe content.
  2. The developers trained the model in a manner that reduced memorization (i.e. the model outputting an exact piece of text from the training data) to avoid generating PII. An experiment showed that verbatim memorization was under 1%.
  3. During post-training , a dataset of human-created and synthetic examples was used to teach the model which types of inputs should be refused. This dataset was incorporated both into supervised fine-tuning and direct preference optimization.

The report details the effect of these safe-guards in terms of violations and false refusal rates (when a model incorrectly refuses a safe request) across languages and scenarios, and shows that the Llama models perform competitively and achieve a solid balance between helpfulness and safety.
In addition, the developers released Llama-Guard, an additional classifier model designed to detect unsafe prompts, and show how it significantly decreases unsafe text generations.

Report Issue / Feedback