Integrity
Ill-formed rows
Definition
A row with text is considered ill-formed if it contains more non-alphabetical characters than alphabetical. The ill-formed rows test allows you to set a threshold on the number of rows that are ill-formed.
Taxonomy
- Category: Integrity.
- Task types: LLM, text classification.
- Availability: and .
Why it matters
- Ill-formed rows can be a sign of data quality issues.
- Understanding the extent of ill-formed data helps in designing models that are robust to such anomalies. If your model is expected to encounter similar data in production, you might want to train it with some level of noise tolerance.
Test configuration examples
If you are writing a tests.json
, here are a few valid configurations for the character length test:
Related
Was this page helpful?