Model Rating Report
Llama-3 Family
Llama-3 is a family of open-access of large-language models released by Meta. This page provides an analysis for both the "base" models and the "instruct" models.
Developer
Meta
Country of Origin
USA
Systemic Risk
Open Data
Open Weight
API Access Only
Ratings
Overall Transparency
75%
Data Transparency
60%
Model Transparency
76%
Evaluation Transparency
92%
EU AI Act Readiness
81%
CAIT-D Readiness
76%
Transparency Assessment
The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.
Sources
Hugging Face Hub: https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
Model Card: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
Release Announcement: https://ai.meta.com/blog/meta-llama-3/
Research Paper: https://arxiv.org/pdf/2407.21783
Basic Details
Date of Release
Llama-3 models were released on in April and in July 2024
Methods of Distribution
The models can be accessed through Hugging Face or downloaded directly from the Meta website.
Modality
Llama-3 is a text-to-text generative model.
Input and Output Format
Prompt and output formatting is available in the Hugging Face and Github documentation. All models support sequence length up to 8192 tokens.
License
Llama-3 is released under a custom license with some limitations for commercial use. The license requires explicit attribution: Derived models, for instance, need to include "Llama 3" at the beginning of their name, and you also need to mention "Built with Meta Llama 3" in derivative works or services.License: https://llama.meta.com/llama3/license/
Instructions for Use
Instructions for Use are available on Github and on Hugging Face.
Documentation Support
Medium Transparency
Changelog
Llama-3 is a one-time released model.
Policy
Acceptable Use Policy
The Acceptable Use Policy is available here: https://llama.meta.com/llama3/use-policy/
User Data
Meta user data is not used to train Llama models.
Data Takedown
AI Ethics Statement
Meta detail their approach to responsible AI production [here](https://ai.meta.com/blog/responsible-ai-connect-2024/).
Incident Reporting
Reporting issues with the model: https://github.com/meta-llama/llama-models/issues . Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback
Model and Training
Task Description
Medium Transparency
The base models are intended to be adapted for a variety of natural language and code generation tasks, while the -instruct models are intended to be sued for assistant-like chat. Evaluation on several common reasoning and knowledge benchmarks is published.
The documentation discusses in general terms that some safety protections into the -Instruct models, but inaccurate, biased or other objectionable responses to user prompts are possible. In addition, the documentation states that most evaluation was done on English-language content and consistent performance in other languages is not guaranteed.
Number of Parameters
The Model family consists of a 8B and 70B parameter variant.
Model Design
High Transparency
Llama-3 is based on the Llama-2 architecture, that is discussed in detail in that model's paper (see our Llama-2 rating page for more details). Two key architecture changes are included: the vocabulary is expanded to 128k tokens to encode language more effectively and Grouped-Query Attention is used for increased efficiency.
Training Methodology
Medium Transparency
Training process is described in general terms: Pre-training for the base model, followed by a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO) for the Instruction model.
In the release announcement, a short discussion of the pre-training process, like parallelization strategies is provided.
Hyperparameters
Unknown
Information about parameters used during training was not identified.
Computational Resources
According to the Model Card, the model's were trained on Meta's Research SuperCluster. A total of7.7M GPU hours of computation on hardware of type H100-80GB were used to train the family of models.
Energy Consumption
According to the Model Card, the estimated total emissions were 2290 tCO2eq.
System Architecture
Not Applicable
Training Hardware
Data
Dataset Size
According to the Model Card, Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples.
Dataset Description
Medium Transparency
The pre-training dataset consists of a mix of about 50% general-knowledge tokens (primarily filtered web data), 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.
The post-training preference dataset was collected internally at Meta and involved annotator ranking the strength of their preference into one of four levels. The preference conversations topics included general english interactions (50%), reasoning and tool use (21%), and coding (15%).
The supervised fine-tuning data consisted of human annotated examples, synthetic data and additional curated datasets discussed in the report.
Data Sources
Medium Transparency
The pre-training data consists of a curated set of web data, collected via an unspecified process. The post-training datasets consisted of human-annotated prompt-output pairs and of new synthetic data that targeted coding, multi-lingual, mathematical, long-context and tool-use capabilities. Details about how the different post-training datasets were collected are available in the report.
Data Collection - Human Labor
Unknown
The documentation references human-annotated examples used during fine-tuning, but does not provide any details about how these examples were produced.
Data Preprocessing
High Transparency
The pre-training web data was processed using a multi-step process that included PII and safety filtering (by excluding websites that are known to contain harmful and adult content), custom HTML parsing, deduplication (on a URL, Document and Line level), heuristic filtering to remove low quality data (e.g. removing lines that consisted of duplicated content by error messages) and model-based quality filtering (using Llama-2 to do quality assessment). Separate filtering and extraction methods were used for English, multilingual and code/math documents. Post-training data, which was largely synthetically generated, was cleaned to remove undesired observed patterns (e.g.using an overly-apologetic tone).
Data Bias Detection
Unknown
No preprocessing or analyses relating to bias could be identified.
Data Deduplication
Deduplication is implemented at a URL, Document and Line level.
Data Toxic and Hateful Language Handling
Filters are implemented to remove data from domains ranked as harmful by Meta safety standards and domains known to contain adult content.
IP Handling in Data
Data PII Handling
Filters are implemented to remove data from websites known to contain high volumes of PII.
Data Collection Period
The knowledge cut-off for the pre-training dataset is at the end of 2023.
Evaluation
Performance Evaluation
High Transparency
The performance of the Llama models was evaluated on general knowledge, mathematical reasoning, coding, multi-linguality, tool-use and long-context benchmarks. LLama-3-7B performs similarly to models of the same size, while Llama-3-405B generally outperforms other available open-source model and lags slightly behind GPT-4 and Gemini. The analysis included additional tests for model robustness (e.g. accounting for variability in multiple-choice benchmarks), adversarial benchmarks and a contamination analysis. Additional quality evaluation was conducted by asking human evaluations to rank outputs from Llama-3-405B and GPT-4o on multiple task types. Llama out performed GPT-4o on 2 out of 3 task types.
Evaluation of Limitations
Medium Transparency
The safety and helpfulness of the Llama models was measured on an internal benchmark that evaluated violation rates (i.e. the model producing a response to an unsafe prompt) and false refusal rates (i.e. the model refusing to produce a response to a safe prompt). This evaluation was separated out by language and included categories for advanced capabilities like tool usage and long context question-answering. Llama-3 general performed similarly to competitors and maintained low violation and false refusal rates, showing a balance between helpfulness and safety.
Evaluation with Public Tools
https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md
Adversarial Testing Procedure
High Transparency
A multi-disciplinary red-team measured risks across a 13-topic Safety taxonomy. A combination of manual and automated prompting techniques were used across single and multi-turn conversations for a comprehensive understanding of vulnerabilities.
Additional adversarial testing measured chemical/biological weapon and cybersecurity risks. For chemical/biological weapon risks, an uplift study was conducted where individuals ability to generate fictitious plans for a chemical/biological attack was evaluated in two set-ups: access to the internet VS access to Llama and the internet. This exercise showed that Llama did not cause a significant uplift in ability to complete malicious tasks. For Cybersecurity, the CyberSecEval benchmark was complemented by a new benchmark for spear phishing and autonomous cyberattacks. The developers found that Llama 3 does not have significant susceptibilities in generating malicious code or exploiting vulnerabilities. In addition, "uplift testing" showed that Llama did not significantly aid either expert or novice individuals in a mock cyberattack task, compared to a cohort with access to the internet alone.
Model Mitigations
High Transparency
Several types of mitigations were incorporated into the Llama models:
1. Pre-training data underwent significant filtering to remove unsafe content.
2. The developers trained the model in a manner that reduced memorization (i.e. the model outputting an exact piece of text from the training data) to avoid generating PII. An experiment showed that verbatim memorization was under 1%.
3. During post-training , a dataset of human-created and synthetic examples was used to teach the model which types of inputs should be refused. This dataset was incorporated both into supervised fine-tuning and direct preference optimization.
The report details the effect of these safe-guards in terms of violations and false refusal rates (when a model incorrectly refuses a safe request) across languages and scenarios, and shows that the Llama models perform competitively and achieve a solid balance between helpfulness and safety.
In addition, the developers released Llama-Guard, an additional classifier model designed to detect unsafe prompts, and show how it significantly decreases unsafe text generations.