Compare RAG quality across models
Each test question is run through the full RAG pipeline (retrieve then generate) using 3 different models, then scored by Claude Sonnet 4.6 as judge on five dimensions: correct (no factual errors), complete (covers required points), cites source, no hallucination, and formatting (valid markdown).
Takes ~30-60 seconds for 48 runs.