Aggregate metrics

Definition

Aggregate metric tests allow you to define the expected level of model performance for the entire validation set or specific subpopulations. You can use any of the available metrics for the task type you are working on.

To compute most of the aggregate metrics supported, your data must contain ground truths.

For monitoring use cases, if your data is not labeled during publish/stream time, you can update ground truths later on. Check out the Updating data guide for details.

Taxonomy

Category: Performance.
Task types: LLM, tabular classification, tabular regression, text classification.
Availability: and .

Why it matters

Aggregate metrics are a straightforward way to measure model performance.
Overall aggregate metrics (i.e., computed on the entire validation set or production data) are useful to get a high-level view of the model performance. However, we encourage you to go beyond them and also define tests for specific subpopulations.
The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below. A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time.

Available metrics

The aggregate metrics available for LLM projects are:

Metric	Description	`measurement` for the `tests.json`
Answer relevancy*	Measures how relevant the answer (output) is given the question. Based on the Ragas response relevancy.	`answerRelevancy`
Answer correctness*	Compares and evaluates the factual accuracy of the generated response with respect to the reference. Based on the Ragas factual correctness.	`answerCorrectness`
Context precision*	Measures how relevant the context retrieved is given the question. Based on the Ragas context precision.	`contextRelevancy`
Context recall*	Measures the ability of the retriever to retrieve all necessary context for the question. Based on the Ragas context recall.	`contextRecall`
Correctness*	Correctness of the answer. Based on the Ragas aspect critique for correctness.	`correctness`
Harmfulness*	Harmfulness of the answer. Based on the Ragas aspect critique for harmfulness.	`harmfulness`
Coherence*	Coherence of the answer. Based on the Ragas aspect critique for coherence.	`coherence`
Conciseness*	Conciseness of the answer. Based on the Ragas aspect critique for conciseness.	`conciseness`
Maliciousness*	Maliciousness of the answer. Based on the Ragas aspect critique for maliciousness.	`maliciousness`
Faithfulness*	Measures the factual consistency of the generated answer against the given context. Based on the Ragas faithfulness.	`faithfulness`
Mean BLEU	Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).	`meanBleu1`, `meanBleu2`, `meanBleu3`, `meanBleu4`
Mean edit distance	Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.	`meanEditDistance`
Mean exact match	Assesses if two strings are identical in every aspect.	`meanExactMatch`
Mean JSON score	Measures how close the output is to a valid JSON.	`meanJsonScore`
Mean quasi-exact match	Assesses if two strings are similar, allowing partial matches and variations.	`meanQuasiExactMatch`
Mean semantic similarity	Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space.	`meanSemanticSimilarity`
Mean, max, and total number of tokens	Statistics on the number of tokens.	`meanTokens`, `maxTokens`, `totalTokens`
Mean, max, and latency percentiles	Statistics on the response latency.	`meanLatency`, `maxLatency`, `p90Latency`, `p95Latency`, `p99Latency`

* Metric based on the Ragas framework. All of them rely on an LLM evaluator judging your submission. You can configure the underlying LLM used to compute these metrics. Check out the OpenAI or Anthropic integration guides for details.

The aggregate metrics available for LLM projects are:

Metric	Description	`measurement` for the `tests.json`
Answer relevancy*	Measures how relevant the answer (output) is given the question. Based on the Ragas response relevancy.	`answerRelevancy`
Answer correctness*	Compares and evaluates the factual accuracy of the generated response with respect to the reference. Based on the Ragas factual correctness.	`answerCorrectness`
Context precision*	Measures how relevant the context retrieved is given the question. Based on the Ragas context precision.	`contextRelevancy`
Context recall*	Measures the ability of the retriever to retrieve all necessary context for the question. Based on the Ragas context recall.	`contextRecall`
Correctness*	Correctness of the answer. Based on the Ragas aspect critique for correctness.	`correctness`
Harmfulness*	Harmfulness of the answer. Based on the Ragas aspect critique for harmfulness.	`harmfulness`
Coherence*	Coherence of the answer. Based on the Ragas aspect critique for coherence.	`coherence`
Conciseness*	Conciseness of the answer. Based on the Ragas aspect critique for conciseness.	`conciseness`
Maliciousness*	Maliciousness of the answer. Based on the Ragas aspect critique for maliciousness.	`maliciousness`
Faithfulness*	Measures the factual consistency of the generated answer against the given context. Based on the Ragas faithfulness.	`faithfulness`
Mean BLEU	Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).	`meanBleu1`, `meanBleu2`, `meanBleu3`, `meanBleu4`
Mean edit distance	Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.	`meanEditDistance`
Mean exact match	Assesses if two strings are identical in every aspect.	`meanExactMatch`
Mean JSON score	Measures how close the output is to a valid JSON.	`meanJsonScore`
Mean quasi-exact match	Assesses if two strings are similar, allowing partial matches and variations.	`meanQuasiExactMatch`
Mean semantic similarity	Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space.	`meanSemanticSimilarity`
Mean, max, and total number of tokens	Statistics on the number of tokens.	`meanTokens`, `maxTokens`, `totalTokens`
Mean, max, and latency percentiles	Statistics on the response latency.	`meanLatency`, `maxLatency`, `p90Latency`, `p95Latency`, `p99Latency`

The aggregate metrics available for tabular classification and text classification projects are:

Metric	Description	`measurement` for the `tests.json`
Accuracy	The classification accuracy. Defined as the ratio of the number of correctly classified samples and the total number of samples.	`accuracy`
Precision per class	The precision score for each class. Given by TP / (TP + FP).	`precisionPerClass`
Recall per class	The recall score for each class. Given by TP / (TP + FN).	`recallPerClass`
F1 per class	The F1 score for each class. Given by 2 _ ( Precision _ Recall ) / ( Precision + Recall ).	`f1PerClass`
Precision	For binary classification, the precision considering class 1 as “positive.” For multiclass classification, the macro-average of the precision score for each class, i.e., treating all classes equally.	`precision`
Recall	For binary classification, the recall considering class 1 as “positive.” For multiclass classification, the macro-average of the recall score for each class, i.e., treating all classes equally.	`recall`
F1	For binary classification, the F1 considering class 1 as “positive.” For multiclass classification, the macro-average of the F1 score for each class, i.e., treating all classes equally.	`f1`
ROC AUC	The macro-average of the area under the receiver operating characteristic curve score for each class, i.e., treating all classes equally. For multi-class classification tasks, uses the one-versus-one configuration.	`rocAuc`
False positive rate	Given by FP / (FP + TN). The false positive rate is only available for binary classification tasks.	`falsePositiveRate`
Geometric mean	The geometric mean of the precision and the recall.	`geometricMean`
Log loss	Measure of the dissimilarity between predicted probabilities and the true distribution. Also known as cross-entropy loss or binary cross-entropy (in the binary classification case).	`logLoss`

Where:

TP: true positive.
TN: true negative.
FP: false positive.
FN: false negative.

The aggregate metrics available for tabular regression projects are:

Metric	Description	`measurement` for the `tests.json`
Mean squared error (MSE)	Average of the squared differences between the predicted values and the true values.	`mse`
Root mean squared error (RMSE)	The square root of the MSE.	`rmse`
Mean absolute error (MAE)	Average of the absolute differences between the predicted values and the true values.	`mae`
R-squared	Also known as coefficient of determination. Quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.	`r2`
Mean absolute percentage error (MAPE)	Average of the absolute percentage differences between the predicted values and the true values.	`mape`

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "Mean answer relevancy greater than 0.8",
    "description": "Ragas-based answer relevancy over the data is greater than 0.8",
    "type": "performance",
    "subtype": "metricThreshold",
    "thresholds": [
      {
        "insightName": "metrics",
        "insightParameters": null,
        "measurement": "answerRelevancy", // Check tables above for possible `measurement` values
        "operator": ">",
        "value": 0.8
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]

Get started

Set up tests

Test your system offline

Monitor your live system

Other resources

Definition

Taxonomy

Why it matters

Available metrics

Test configuration examples

Get started

Set up tests

Test your system offline

Monitor your live system

Other resources

​Definition

​Taxonomy

​Why it matters

​Available metrics

​Test configuration examples

Definition

Taxonomy

Why it matters

Available metrics

Test configuration examples