Definition
The session guideline adherence test evaluates whether the assistant followed one or more customer-supplied behavioural guidelines across a conversation. You supply each guideline as an object with acriteria rule (e.g., “always refer to the user
by their first name”) and a scoring mode ("Yes or No" or "0-1"). An
LLM-as-a-judge evaluates the full session against each guideline independently and
reports per-guideline scores plus aggregate adherence metrics.
This is the session-level metric most suited to codifying product-specific behaviour
that generic catalog tests don’t cover.
Taxonomy
- Task types: LLM.
- Availability: and .
- Evaluation level: session.
- Polarity: higher score = better (guideline followed).
Why it matters
- Almost every product has a handful of custom behavioural rules that don’t map to the generic catalog — this metric is the knob for those.
- Because the guideline is customer-supplied, you can track behaviour unique to your product, brand voice, or regulatory context.
- Multi-guideline support lets you bundle a policy (brand voice + tone + forbidden topics) into a single insight and see aggregate adherence across a session.
Required columns
- Input: The user’s message in each turn.
- Output: The assistant’s response in each turn.
- Session ID: Groups turns belonging to the same conversation.
- Timestamp: Used to reconstruct turn order within a session.
Insight parameters
criteria_list(list of objects, required): One or more guidelines to evaluate. Each object contains:criteria(string): The guideline text in plain English.scoring(string): Either"Yes or No"(binary judgment — adherent or not) or"0-1"(a continuous 0–1 float where1= perfect adherence).
All guidelines in a single
criteria_list should use the same scoring mode
— the default adherence threshold is derived from the first entry’s mode.Default adherence threshold
A session is counted as “adherent” when its aggregate score is at or above:0.8whenscoringis"0-1"0.5whenscoringis"Yes or No"(Yes is encoded as1.0, No as0.0)
adherentSessionsPercent and violationRate measurements
below.
Available measurements
| Measurement | What it means |
|---|---|
meanGuidelineAdherence | Average adherence score across sessions in the window |
medianGuidelineAdherence | Median adherence score across sessions |
stdGuidelineAdherence | Standard deviation of per-session adherence scores |
minGuidelineAdherence | Lowest per-session adherence score |
maxGuidelineAdherence | Highest per-session adherence score |
adherentSessionsPercent | % of sessions meeting the default adherence threshold |
violationRate | % of sessions below the default adherence threshold |
totalEvaluatedSessions | Count of sessions the judge evaluated |
erroredSessionsCount | Count of sessions that errored during evaluation |
skippedSessionsCount | Count of sessions skipped (e.g., insufficient turns) |
Test configuration examples
Related
- Session role adherence — related signal focused on staying in a defined role.

