CLI tool for benchmarking and evaluating AI coding agents like Claude, Codex, and Gemini using your own API keys. Run evals on tasks you understand with support for multiple LLM providers.

What it does

nasde-toolkit is a command-line interface for benchmarking and evaluating AI coding agents across multiple providers. It allows developers to test and compare the performance of Claude, Codex, Gemini, and other LLMs on custom coding tasks using their own subscriptions or API keys.

How to set up

Install nasde-toolkit via pip or clone from GitHub. Configure your API keys for Claude, Codex, and/or Gemini in environment variables. Define your benchmark tasks and run evaluation commands through the CLI to generate detailed performance reports and comparisons.

Key features

Support for multiple AI coding agents (Claude, Codex, Gemini)
Customizable benchmark tasks aligned with your workflow
LLM-as-a-Judge evaluation methodology
Sandbox environment for safe code execution
Integration with Model Context Protocol (MCP)
Detailed evaluation metrics and performance analytics
Python-based extensible framework

nasde-toolkit

What it does

How to set up

Key features

Related Skills

Monday.com MCP Server

Sentry MCP Server

Cloudflare MCP Server