Trustible AI

Model Rating Report

Llama-3 Family

Llama-3 is a family of open-access of large-language models released by Meta. This page provides an analysis for both the "base" models and the "instruct" models.

Developer

Transparency Assessment

The transparency assessment evaluates how clear and detailed the model creators are about their practices. Our assessment is based on the official documentation lists in Sources above. While external analysis may contain additional details about this system, our goal is to evaluate transparency of the providers themselves.

Sources

Hugging Face Hub: https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6
Model Card: https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
Release Announcement: https://ai.meta.com/blog/meta-llama-3/
Research Paper: https://arxiv.org/pdf/2407.21783

Basic Details

Date of Release

Llama-3 models were released on in April and in July 2024

Methods of Distribution

The models can be accessed through Hugging Face or downloaded directly from the Meta website.

Modality

Llama-3 is a text-to-text generative model.

Input and Output Format

Prompt and output formatting is available in the Hugging Face and Github documentation. All models support sequence length up to 8192 tokens.

License

Llama-3 is released under a custom license with some limitations for commercial use. The license requires explicit attribution: Derived models, for instance, need to include "Llama 3" at the beginning of their name, and you also need to mention "Built with Meta Llama 3" in derivative works or services.License: https://llama.meta.com/llama3/license/

Instructions for Use

Instructions for Use are available on Github and on Hugging Face.

Documentation Support

Medium Transparency

Changelog

Llama-3 is a one-time released model.

Policy

Acceptable Use Policy

The Acceptable Use Policy is available here: https://llama.meta.com/llama3/use-policy/

User Data

Meta user data is not used to train Llama models.

Data Takedown

AI Ethics Statement

Meta detail their approach to responsible AI production [here](https://ai.meta.com/blog/responsible-ai-connect-2024/).

Incident Reporting

Reporting issues with the model: https://github.com/meta-llama/llama-models/issues . Reporting risky content generated by the model: developers.facebook.com/llama_output_feedback

Model and Training

Task Description

Medium Transparency

The base models are intended to be adapted for a variety of natural language and code generation tasks, while the -instruct models are intended to be sued for assistant-like chat. Evaluation on several common reasoning and knowledge benchmarks is published.

The documentation discusses in general terms that some safety protections into the -Instruct models, but inaccurate, biased or other objectionable responses to user prompts are possible. In addition, the documentation states that most evaluation was done on English-language content and consistent performance in other languages is not guaranteed.

Number of Parameters

The Model family consists of a 8B and 70B parameter variant.

Model Design

High Transparency

Llama-3 is based on the Llama-2 architecture, that is discussed in detail in that model's paper (see our Llama-2 rating page for more details). Two key architecture changes are included: the vocabulary is expanded to 128k tokens to encode language more effectively and Grouped-Query Attention is used for increased efficiency.

Training Methodology

Medium Transparency

Training process is described in general terms: Pre-training for the base model, followed by a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO) for the Instruction model.

In the release announcement, a short discussion of the pre-training process, like parallelization strategies is provided.

Hyperparameters

Unknown

Information about parameters used during training was not identified.

Computational Resources

According to the Model Card, the model's were trained on Meta's Research SuperCluster. A total of7.7M GPU hours of computation on hardware of type H100-80GB were used to train the family of models.

Energy Consumption

According to the Model Card, the estimated total emissions were 2290 tCO2eq.

System Architecture

Not Applicable

Training Hardware

Data

Dataset Size

According to the Model Card, Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples.

Dataset Description

Medium Transparency

The pre-training dataset consists of a mix of about 50% general-knowledge tokens (primarily filtered web data), 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.

The post-training preference dataset was collected internally at Meta and involved annotator ranking the strength of their preference into one of four levels. The preference conversations topics included general english interactions (50%), reasoning and tool use (21%), and coding (15%).

The supervised fine-tuning data consisted of human annotated examples, synthetic data and additional curated datasets discussed in the report.

Data Sources

Medium Transparency

The pre-training data consists of a curated set of web data, collected via an unspecified process. The post-training datasets consisted of human-annotated prompt-output pairs and of new synthetic data that targeted coding, multi-lingual, mathematical, long-context and tool-use capabilities. Details about how the different post-training datasets were collected are available in the report.

Data Collection - Human Labor

Unknown

The documentation references human-annotated examples used during fine-tuning, but does not provide any details about how these examples were produced.

Data Preprocessing

High Transparency

The pre-training web data was processed using a multi-step process that included PII and safety filtering (by excluding websites that are known to contain harmful and adult content), custom HTML parsing, deduplication (on a URL, Document and Line level), heuristic filtering to remove low quality data (e.g. removing lines that consisted of duplicated content by error messages) and model-based quality filtering (using Llama-2 to do quality assessment). Separate filtering and extraction methods were used for English, multilingual and code/math documents. Post-training data, which was largely synthetically generated, was cleaned to remove undesired observed patterns (e.g.using an overly-apologetic tone).

Data Bias Detection

Unknown

No preprocessing or analyses relating to bias could be identified.

Data Deduplication

Deduplication is implemented at a URL, Document and Line level.

Data Toxic and Hateful Language Handling

Filters are implemented to remove data from domains ranked as harmful by Meta safety standards and domains known to contain adult content.

IP Handling in Data

Data PII Handling

Filters are implemented to remove data from websites known to contain high volumes of PII.

Data Collection Period

The knowledge cut-off for the pre-training dataset is at the end of 2023.

Evaluation

Performance Evaluation

High Transparency

The performance of the Llama models was evaluated on general knowledge, mathematical reasoning, coding, multi-linguality, tool-use and long-context benchmarks. LLama-3-7B performs similarly to models of the same size, while Llama-3-405B generally outperforms other available open-source model and lags slightly behind GPT-4 and Gemini. The analysis included additional tests for model robustness (e.g. accounting for variability in multiple-choice benchmarks), adversarial benchmarks and a contamination analysis. Additional quality evaluation was conducted by asking human evaluations to rank outputs from Llama-3-405B and GPT-4o on multiple task types. Llama out performed GPT-4o on 2 out of 3 task types.

Evaluation of Limitations

Medium Transparency

The safety and helpfulness of the Llama models was measured on an internal benchmark that evaluated violation rates (i.e. the model producing a response to an unsafe prompt) and false refusal rates (i.e. the model refusing to produce a response to a safe prompt). This evaluation was separated out by language and included categories for advanced capabilities like tool usage and long context question-answering. Llama-3 general performed similarly to competitors and maintained low violation and false refusal rates, showing a balance between helpfulness and safety.

Evaluation with Public Tools

https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/eval_details.md

Adversarial Testing Procedure

High Transparency

A multi-disciplinary red-team measured risks across a 13-topic Safety taxonomy. A combination of manual and automated prompting techniques were used across single and multi-turn conversations for a comprehensive understanding of vulnerabilities.

Additional adversarial testing measured chemical/biological weapon and cybersecurity risks. For chemical/biological weapon risks, an uplift study was conducted where individuals ability to generate fictitious plans for a chemical/biological attack was evaluated in two set-ups: access to the internet VS access to Llama and the internet. This exercise showed that Llama did not cause a significant uplift in ability to complete malicious tasks. For Cybersecurity, the CyberSecEval benchmark was complemented by a new benchmark for spear phishing and autonomous cyberattacks. The developers found that Llama 3 does not have significant susceptibilities in generating malicious code or exploiting vulnerabilities. In addition, "uplift testing" showed that Llama did not significantly aid either expert or novice individuals in a mock cyberattack task, compared to a cohort with access to the internet alone.

Model Mitigations

High Transparency

Several types of mitigations were incorporated into the Llama models:

1. Pre-training data underwent significant filtering to remove unsafe content.
2. The developers trained the model in a manner that reduced memorization (i.e. the model outputting an exact piece of text from the training data) to avoid generating PII. An experiment showed that verbatim memorization was under 1%.
3. During post-training , a dataset of human-created and synthetic examples was used to teach the model which types of inputs should be refused. This dataset was incorporated both into supervised fine-tuning and direct preference optimization.

The report details the effect of these safe-guards in terms of violations and false refusal rates (when a model incorrectly refuses a safe request) across languages and scenarios, and shows that the Llama models perform competitively and achieve a solid balance between helpfulness and safety.
In addition, the developers released Llama-Guard, an additional classifier model designed to detect unsafe prompts, and show how it significantly decreases unsafe text generations.