Clever Dev DocsLibrary, Secure Sync, SSO, and APIs

RAG Evaluation

16 test questions × 3 models = 48 runs

History

Compare RAG quality across models

Each test question is run through the full RAG pipeline (retrieve then generate) using 3 different models, then scored by Claude Sonnet 4.6 as judge on five dimensions: correct (no factual errors), complete (covers required points), cites source, no hallucination, and formatting (valid markdown).

Takes ~30-60 seconds for 48 runs.