Definition

Let A be a column in a dataset containing strings. Let B be a column in a dataset containing lists of strings.

The column contains string test asserts that the list of strings in B contains the string in A on a per-row basis.

For example:

ABResult
”a”[“a”, “b”, “c”]✓ Passed
”b”[“a”, “b”, “c”]✓ Passed
”c”[“a”, “b”, “c”]✓ Passed
”d”[“a”, “b”, “c”]x Failed

Since “d” is not in the list [“a”, “b”, “c”], the test fails.

Taxonomy

  • Category: Integrity.
  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • In particular for RAG LLM projects, the context retriever will return a list of the top K contexts. The column contains string test can be used to ensure that the context retriever returns at least one of the correct contexts.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "Values in 'top_k_contexts' should be in 'correct_context' for every row",
    "description": "Asserts that the list of strings in 'top_k_contexts' contains the string in 'correct_context' on a per-row basis.",
    "type": "integrity",
    "subtype": "expectColumnAToBeInColumnB",
    "thresholds": [
      {
        "insightName": "expectColumnAToBeInColumnB",
        "insightParameters": [
          {
            "name": "column_a_name",
            "value": "correct_context" // Selects column A (`correct_context`)
          },
          {
            "name": "column_b_name",
            "value": "top_k_contexts" // Selects column B (`top_k_contexts`)
          }
        ],
        "measurement": "failingRowCount",  // Use the absolute row count
        "operator": "<=",
        "value": 0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "Values in 'top_k_contexts' should be in 'correct_context' for at least 80% of the rows",
    "description": "Asserts that the list of strings in 'top_k_contexts' contains the string in 'correct_context' on a per-row basis.",
    "type": "integrity",
    "subtype": "expectColumnAToBeInColumnB",
    "thresholds": [
      {
        "insightName": "expectColumnAToBeInColumnB",
        "insightParameters": [
          {
            "name": "column_a_name",
            "value": "correct_context" // Selects column A (`correct_context`)
          },
          {
            "name": "column_b_name",
            "value": "top_k_contexts" // Selects column B (`top_k_contexts`)
          }
        ],
        "measurement": "failingRowPercentage", // Use the row percentage
        "operator": "<",
        "value": 0.2
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  }
]