Synthetic data at
industrial scale.

KothaSet fuels the next generation of SLMs. Leverage SOTA teacher models to generate high-quality, validated datasets for fine-tuning.

npm install -g kothaset

Read Documentation Open Source

Source

GPT-4o

Engine

KothaSet

Output

.jsonl

Engineered for Reliability

Generating millions of tokens requires more than a simple script. KothaSet is built to handle the chaos of large-scale API consumption.

High-Concurrency Engine

Built on Go's goroutines, the parallel worker pool maximizes throughput while strictly respecting API rate limits. Generate datasets 10x faster than standard Python scripts.

Atomic Checkpointing

Never lose a token. KothaSet writes to disk atomically. Resume interrupted runs exactly where they left off.

Run interrupted at 85%...

$ kothaset generate --resume

Resuming from ID #8501

Provider Agnostic

Native support for OpenAI, DeepSeek, vLLM, and Ollama.

Strict Schemas

Built-in validation for Instruction, Chat, and Preference datasets.

Define your data.
Let KothaSet do the rest.

Configure

Set up your teacher model (e.g., GPT-4o) and output schema in a simple YAML file.

Seed

Provide a list of topics or seed prompts to ensure diversity and coverage across your domain.

Generate

Run the CLI. KothaSet handles retries, rate limits, and validation automatically.

.kothaset.yaml

version: "1.0" providers: - name: main-teacher type: openai model: gpt-4o rate_limit: requests_per_minute: 500 global: concurrency: 20 default_schema: instruction

Ready to build your dataset?

Open source, free to use, and ready for your next fine-tuning project.

Get Started Star on GitHub

Synthetic data at industrial scale.