Skip to main content

Documentation Index

Fetch the complete documentation index at: https://openlayer.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Definition

The duplicate rows test checks if there are rows that are identical to each other in the dataset.

Taxonomy

  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • Duplicate rows on the training set can lead the model to overfit on the duplicated examples.
  • Duplicate rows on the validation set can distort the aggregate metrics, making them overly optimistic or pessimistic.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:
[
  {
    "name": "No duplicate rows",
    "description": "Asserts that there are no duplicate rows",
    "type": "integrity",
    "subtype": "duplicateRowCount",
    "thresholds": [
      {
        "insightName": "duplicateRowCount",
        "insightParameters": null,
        "measurement": "duplicateRowCount", // Using the absolute row count
        "operator": "<=",
        "value": 0 // integer
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "No duplicate rows",
    "description": "Asserts that there are no duplicate rows",
    "type": "integrity",
    "subtype": "duplicateRowCount",
    "thresholds": [
      {
        "insightName": "duplicateRowCount",
        "insightParameters": null,
        "measurement": "duplicateRowPercentage", // Using the row percetage
        "operator": "<=",
        "value": 0.0 // float, between 0-1
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  }
]