Definition

The duplicate rows test checks if there are rows that are identical to each other in the dataset.

Taxonomy

  • Category: Integrity.
  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • Duplicate rows on the training set can lead the model to overfit on the duplicated examples.
  • Duplicate rows on the validation set can distort the aggregate metrics, making them overly optimistic or pessimistic.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "No duplicate rows",
    "description": "Asserts that there are no duplicate rows",
    "type": "integrity",
    "subtype": "duplicateRowCount",
    "thresholds": [
      {
        "insightName": "duplicateRowCount",
        "insightParameters": null,
        "measurement": "duplicateRowCount", // Using the absolute row count
        "operator": "<=",
        "value": 0 // integer
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "No duplicate rows",
    "description": "Asserts that there are no duplicate rows",
    "type": "integrity",
    "subtype": "duplicateRowCount",
    "thresholds": [
      {
        "insightName": "duplicateRowCount",
        "insightParameters": null,
        "measurement": "duplicateRowPercentage", // Using the row percetage
        "operator": "<=",
        "value": 0.0 // float, between 0-1
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  }
]