Integrity
Duplicate rows
Learn how to use the duplicate rows test
Definition
The duplicate rows test checks if there are rows that are identical to each other in the dataset.
Taxonomy
- Category: Integrity.
- Task types: LLM, tabular classification, tabular regression, text classification.
- Availability: and .
Why it matters
- Duplicate rows on the training set can lead the model to overfit on the duplicated examples.
- Duplicate rows on the validation set can distort the aggregate metrics, making them overly optimistic or pessimistic.
Test configuration examples
If you are writing a tests.json
, here are a few valid configurations for the character length test:
Was this page helpful?