Definition

The character length test allows you to define minimum and/or maximum bounds on the number of characters in a column.

Taxonomy

  • Category: Integrity.
  • Task types: LLM, text classification.
  • Availability: and .

Why it matters

  • Extremely long or short text entries might be outliers or noise, such as corrupted data, spam, or non-relevant entries.
  • Models often have limitations on the length of input they can effectively process. Inputs longer than this limit may be truncated, potentially losing important information, while very short inputs might not provide enough context for accurate processing. Making sure that your data falls within these limits is important to ensure model performance.
  • If a model is trained on data with a certain length distribution, it might not perform well on texts of significantly different lengths.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "Maximum character length of 5000",
    "description": "Asserts that the output has at most 5000 characters",
    "type": "integrity",
    "subtype": "characterLength",
    "thresholds": [
      {
        "insightName": "characterLength",
        "insightParameters": [{ "name": "column_name", "value": "output" }], // Count characters in the column `output`
        "measurement": "maxCharacterLength",
        "operator": "<=",
        "value": 5000
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "Minimum character length of 10",
    "description": "Asserts that the output has at least 10 characters",
    "type": "integrity",
    "subtype": "characterLength",
    "thresholds": [
      {
        "insightName": "characterLength",
        "insightParameters": [{ "name": "column_name", "value": "output" }], // Count characters in the column `output`
        "measurement": "minCharacterLength",
        "operator": ">=",
        "value": 10
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": false,
    "usesTrainingDataset": true, // Apply test to the training set
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805e" // Some unique id
  }
]