Definition

The training-validation lekage test allows you to detect training rows that are also present in the validation dataset.

Taxonomy

  • Category: Consistency.
  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: .

Why it matters

  • The training and validation datasets must be completely disjoint. Otherwise, all evaluation insights extracted from the validation set are unreliable and overly optimistic.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "No training-validation leakage",
    "description": "Asserts that no rows from the validation set are present in the training set",
    "type": "consistency",
    "subtype": "trainValLeakageRowCount",
    "thresholds": [
      {
        "insightName": "trainValLeakageRowCount",
        "insightParameters": null,
        "measurement": "trainValLeakageRowCount",
        "operator": "<=",
        "value": 0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": true,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]