Built to Run All Night
Generating 100k samples means thousands of API calls. KothaSet handles rate limits, retries, and failures so you don't have to babysit a script.
High-Concurrency Engine
Runs parallel requests using Go's goroutines. Set your concurrency level and rate limits—KothaSet keeps you under the API quota.
Atomic Checkpointing
Progress saves to disk after every batch. If your run crashes at 85%, just run it again—it picks up where it stopped.
Provider Agnostic
Works with OpenAI, DeepSeek, vLLM, Ollama—any OpenAI-compatible API.
Strict Schemas
Built-in validation for 4 dataset types. Every sample is checked before it's written.
Four Training Formats, One Command
Generate datasets in the formats you need. Each schema includes built-in validation and output formatting.
Instruction
Alpaca-style
Instruction, input, and output pairs for supervised fine-tuning (SFT).
Chat
ShareGPT format
Multi-turn conversations between human and assistant for dialogue models.
Preference
DPO / RLHF
Chosen and rejected response pairs for preference learning.
Classification
Text + Label
Labeled text samples for training classifiers.
Define your data.
Let KothaSet do the rest.
Configure
Set up your teacher model (e.g., GPT-5.2) and output schema in a simple YAML file.
Seed
Provide a list of topics to spread your dataset across different subjects.
Generate
Run the CLI. KothaSet handles retries, rate limits, and validation automatically.
Ready to build your dataset?
Free and open source. Install it, run it, train your model.