Synthetic data at
industrial scale.

Use GPT-5.2, DeepSeek 3.2, Gemini 3, or any OpenAI-compatible model to generate training data. Build instruction pairs, multi-turn chats, or preference datasets—all validated and ready for fine-tuning.

pip install kothaset

Read Documentation Open Source

Source

GPT-5.2

Engine

KothaSet

Output

.jsonl

Built to Run All Night

Generating 100k samples means thousands of API calls. KothaSet handles rate limits, retries, and failures so you don't have to babysit a script.

High-Concurrency Engine

Runs parallel requests using Go's goroutines. Set your concurrency level and rate limits—KothaSet keeps you under the API quota.

Atomic Checkpointing

Progress saves to disk after every batch. If your run crashes at 85%, just run it again—it picks up where it stopped.

Run interrupted at 85%...

$ kothaset generate --resume -i topics.txt

Resuming from ID #8501

Provider Agnostic

Works with OpenAI, DeepSeek, vLLM, Ollama—any OpenAI-compatible API.

Strict Schemas

Built-in validation for 4 dataset types. Every sample is checked before it's written.

Four Training Formats, One Command

Generate datasets in the formats you need. Each schema includes built-in validation and output formatting.

Instruction

Alpaca-style

Instruction, input, and output pairs for supervised fine-tuning (SFT).

Chat

ShareGPT format

Multi-turn conversations between human and assistant for dialogue models.

Preference

DPO / RLHF

Chosen and rejected response pairs for preference learning.

Classification

Text + Label

Labeled text samples for training classifiers.

Define your data.
Let KothaSet do the rest.

Configure

Set up your teacher model (e.g., GPT-5.2) and output schema in a simple YAML file.

Seed

Provide a list of topics to spread your dataset across different subjects.

Generate

Run the CLI. KothaSet handles retries, rate limits, and validation automatically.

.kothaset.yaml

version: "1.0" providers: - name: main-teacher type: openai model: gpt-5.2 rate_limit: requests_per_minute: 500 global: concurrency: 20 default_schema: instruction

Ready to build your dataset?

Free and open source. Install it, run it, train your model.

Get Started Star on GitHub

Synthetic data at industrial scale.