EVALUATION RESULTS

DeepEval Test Suite

67 tests across 3 LLM skills and naive vs HyQ RAG comparison. Claude via TritonAI available as the judge LLM.

Tests passing

Skills evaluated

8/8

HyQ retrievals correct

Loading report...

How it works

DeepEval

An open-source Python framework for testing LLM outputs. Works like pytest but supports LLM-as-judge grading for subjective outputs.

docs.deepeval.com

Claude as Judge

We use Claude to grade Claude's outputs. The judge receives the input, the actual output, and a written rubric. Returns pass/fail with reasoning. LLM-as-judge correlates well with human grading on free-form outputs.

Test Categories

20
Compliance Pre Check
Verifies MAP violations, missing disclaimers, and off-brand copy are caught
14
Approval Brief Generator
Verifies 5-field structure, grounded ROI, and risk alignment
9
Revision Router
Verifies classification accuracy, owner assignment, and urgency
24
RAG Comparison
Naive vs HyQ on 8 test queries, retrieval quality metrics

Sample Test

Metrics Used

GEvalCustom criteria with LLM judge
AnswerRelevancyMetricDid the output address the question
FaithfulnessMetricIs the output grounded in source documents
ContextualRelevancyMetricDid RAG retrieve the right chunks

Run Locally

uv run pytest evals/ -v

Takes 4 to 5 minutes. Results displayed on this page.