Evaluation framework for testing LLM knowledge inputs including prompts, RAG corpora, and agent workflows. Features statistical rigor with bootstrap confidence intervals and Krippendorff's alpha for researchers and engineers.

What it does

oh-my-knowledge is an evaluation framework designed to systematically assess and improve LLM knowledge inputs. It allows you to fix your model while varying the artifacts being evaluated—prompts, RAG corpora, skills, and agent workflows. The framework provides built-in statistical rigor to ensure reliable, reproducible evaluation results.

Key Features

Bootstrap confidence intervals for statistical significance testing
Krippendorff's alpha for inter-rater reliability measurement
Length-debiasing to control for artifact length confounds
Saturation curves to identify evaluation completeness
Multi-judge ensemble support for robust assessments
Evaluation-as-code approach for reproducibility
Support for various LLM knowledge components (prompts, RAG, skills, workflows)

How to set up

Clone the repository from GitHub and install dependencies. The tool is designed as a Python-based framework that integrates with Claude and other LLMs. Detailed setup instructions are available in the project documentation. Users can then define their evaluation scenarios and run statistical analyses on their LLM artifacts.

What it does

Key Features

Bootstrap confidence intervals for statistical significance testing
Krippendorff's alpha for inter-rater reliability measurement
Length-debiasing to control for artifact length confounds
Saturation curves to identify evaluation completeness
Multi-judge ensemble support for robust assessments
Evaluation-as-code approach for reproducibility
Support for various LLM knowledge components (prompts, RAG, skills, workflows)

oh-my-knowledge

What it does

Key Features

How to set up

oh-my-knowledge

What it does

Key Features

How to set up

Related Skills

Thesis Structure Helper

Dependency Auditor

FAQ Generator Pro