Documentation
Everything you need to generate LLM training data, whether you're building your first fine-tuned model or scaling a research pipeline.
Introduction
StackAI generates synthetic training datasets for large language models. You describe what you want (a domain, a schema, a count) and the API returns a ready-to-use JSONL file in minutes. No labeling, no data collection, no privacy concerns.
Supervised Fine-Tuning
Teach a model to follow instructions, adopt a persona, or master a knowledge domain.
instruction_v1Preference Training
Generate chosen/rejected pairs for RLHF and DPO alignment training.
preference_v1Evaluation Benchmarks
Build held-out test sets to measure model capability and track regressions.
eval_v1Conflict (Alignment)
Multi-drive tension scenarios for alignment decision layers with resolution metadata.
conflict_v1What Is Synthetic Data?
Synthetic training data is AI-generated text that mimics the examples a human would write for training, but produced at scale, in minutes, with full control over distribution and quality.
Why not just use real data? Real data is slow to collect, expensive to label, hard to balance across topics, and often contains PII. Synthetic data lets you generate exactly the distribution you need, including rare or adversarial cases that barely appear in real corpora.
Is it effective? Yes. Many production models use synthetic data for fine-tuning, alignment, and evaluation. State-of-the-art models like Claude and GPT-4 are trained with AI-assisted data generation as part of their pipeline. The key is quality control , which is why StackAI runs automatic checks and optional LLM-as-judge scoring on every job.
BeginnerHow language models actually learn
A base LLM (like GPT-4 or Claude before fine-tuning) is trained to predict the next token in a document. It learns facts and language patterns, but it doesn't know how to behave. It won't reliably follow instructions, stay in character, or refuse harmful requests.
Fine-tuning on labeled examples teaches the model specific behaviors. Each training example shows the model: "when you see input X, produce output Y." After seeing thousands of these, the model learns the pattern.
instruction_v1 generates XโY pairs for this. preference_v1 generates pairs where one Y is better than another, teaching the model to prefer better answers. eval_v1 generates test cases so you can measure whether your fine-tuned model actually improved.
BeginnerWhy data quality matters more than quantity
"Garbage in, garbage out" is even more true for LLMs than traditional ML. A model trained on 1,000 high-quality, diverse examples often outperforms one trained on 10,000 repetitive or low-quality examples.
StackAI runs two quality checks on every job: diversity analysis (removing near-duplicate records that would cause the model to overfit) and format validation (ensuring every record is complete and well-formed). Add verified: true for an LLM-as-judge pass that scores relevance, accuracy, and completeness.
Quick Start
Generate your first dataset in under 2 minutes.
Step 1: Get your API key
Go to your dashboard โ API Keys โ Create Key. Copy the key; it won't be shown again.
Step 2: Create a generation job
curl -X POST https://api.stackai.app/v1/synthetic/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema": "instruction_v1",
"domain": "customer support for a SaaS product",
"count": 50,
"model": "balanced"
}'The API returns a job ID immediately. Generation runs asynchronously.
Step 3: Poll until complete, then download
# Poll job status JOB_ID="syn_job_xxxxxxxxxxxxxxxxx" curl https://api.stackai.app/v1/synthetic/jobs/$JOB_ID \ -H "Authorization: Bearer YOUR_API_KEY" # Once status is "succeeded", get a download URL curl "https://api.stackai.app/v1/synthetic/jobs/$JOB_ID/results?format=url" \ -H "Authorization: Bearer YOUR_API_KEY"
BeginnerWhat is an API key and how should I store it?
An API key is a secret token that identifies your account when you call the API programmatically. Think of it like a password: anyone who has it can make requests charged to your account.
Never commit API keys to git, paste them into Slack, or store them in localStorage. The safest approach is to store them as environment variables: export STACKAI_KEY="sk_...", then reference them in your code as process.env.STACKAI_KEY.
Data Schemas
Choose the schema that matches your training goal. Each schema produces a different record structure optimized for a different type of model training.
instruction_v1
Supervised Fine-TuningInstruction-input-output triples for teaching a model to follow instructions, adopt a persona, or answer questions in a specific domain. The most common starting point for model customization.
{
"instruction": "Explain what a webhook is to a non-technical user",
"input": "I keep seeing it mentioned in our integration docs",
"output": "A webhook is like a doorbell for software...",
"metadata": {},
"provenance": { "job_id": "syn_job_abc123", "schema": "instruction_v1", ... }
}BeginnerWhat is supervised fine-tuning (SFT)?
A base language model knows a lot (it has read the internet), but it doesn't know how to behave. SFT teaches it by showing it thousands of (instruction โ correct response) examples. After training on your data, the model learns to respond in your desired style, persona, and domain.
Popular SFT frameworks: Hugging Face TRL, LLaMA-Factory, Axolotl. These all accept the instruction/input/output format that StackAI produces.
Rule of thumb: 500โ5,000 high-quality examples is usually enough for a focused domain. More is only better if it's diverse, which is why StackAI automatically removes near-duplicates.
preference_v1
RLHF / DPOBalanced, Diverse & DeepPrompt + chosen/rejected response pairs with scores and reasoning. Used to train a reward model or directly optimize with DPO. Available on balanced, diverse, and deep tiers. Both responses are scored by a critic pass; rejected answers are intentionally lower-quality but coherent.
{
"prompt": "How should I handle authentication in a REST API?",
"chosen": "Use JWT tokens with short expiry (15 min) stored in httpOnly cookies...",
"rejected": "You can use JWT tokens and store them in localStorage for easy access...",
"chosen_score": 9.0,
"rejected_score": 4.5,
"reasoning": "The chosen response correctly identifies the security implications...",
"metadata": {},
"provenance": { "schema": "preference_v1", ... }
}BeginnerWhat is RLHF and DPO, and which should I use?
After SFT, a model follows instructions, but it may still produce mediocre answers. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) both use preference pairs to teach the model to prefer better responses over worse ones.
RLHF: Train a reward model on the preference pairs, then use PPO to optimize the LLM against the reward model. More complex, higher ceiling. Used by OpenAI and Anthropic for their flagship models.
DPO: Skip the reward model entirely. Directly train on the (prompt, chosen, rejected) triples. Simpler, faster, and often competitive with RLHF on focused domains. Recommended for most fine-tuning use cases.
For DPO, use the Hugging Face TRL DPOTrainer, which accepts the prompt/chosen/rejected format that StackAI produces.
eval_v1
BenchmarkingInput/ideal-output pairs with evaluation metrics for building held-out test sets. Use these to measure model performance before and after fine-tuning, or to detect capability regressions during training runs.
{
"input": "A user says their payment failed but their card was charged...",
"ideal_output": "Apologize and assure the user we take this seriously...",
"metrics": ["task_completion", "tone_appropriateness", "escalation_correct"],
"metadata": {},
"provenance": { "schema": "eval_v1", ... }
}BeginnerWhy you need held-out evaluation data
Never evaluate your model on data it was trained on. That's like testing a student with the exact exam questions they studied. You need a held-out test set: examples the model has never seen.
Use eval_v1 to build this test set before you generate your training data. Run it after each training checkpoint to catch regressions early. A model that improves on the training domain but degrades on the eval set is overfitting.
Combine with StackAI's hard negatives to generate adversarial eval cases that probe failure modes specifically.
conflict_v1
AlignmentBalanced, Diverse & DeepMulti-drive tension scenarios for alignment decision layers. Each record models tension between paired opposing drives and includes resolution metadata. Use these to train models that can reason about competing objectives (safety vs. autonomy, honesty vs. helpfulness) and make principled tradeoffs with explicit confidence scores and override conditions.
{
"record_id": "rec_abc123",
"schema": "conflict_v1",
"data": {
"input": "A user asks the AI to help write a persuasive essay arguing against vaccinations...",
"axis": "safety_autonomy",
"tension_type": "educational_vs_safety",
"responses": [
{ "drive": "safety", "output": "I should decline this request because...", "viable": true },
{ "drive": "autonomy", "output": "I should help with the assignment since...", "viable": true }
],
"resolution": {
"preferred_drive": "autonomy",
"losing_drive": "safety",
"confidence": 0.72,
"override_condition": "If the essay will be published publicly rather than...",
"reasoning": "The educational context provides sufficient safeguards..."
}
}
}BeginnerWhat is multi-drive alignment and why does it matter?
Real-world AI alignment is not binary (safe vs. unsafe). Models face constant tension between competing drives: being helpful vs. being safe, being honest vs. being kind, following instructions vs. refusing harmful requests.
conflict_v1 generates structured scenarios where two valid drives are in tension. Each record includes both sides of the argument, a resolution with a confidence score, and an explicit override condition describing when the opposite resolution would be correct. This teaches models to reason about tradeoffs rather than applying rigid rules.
The confidence field (0.0 to 1.0) reflects how close the decision is. Values near 0.5 indicate genuinely ambiguous cases where reasonable people would disagree. Training on a range of confidence levels teaches the model calibrated uncertainty rather than false certainty.
Request Parameters
All parameters for POST /v1/synthetic/generate. Fields marked * are required.
Top-level fields
| Parameter | Type | Description |
|---|---|---|
schema* | string | Schema type: "instruction_v1", "preference_v1", "eval_v1", or "conflict_v1". Note: preference_v1 and conflict_v1 require Balanced or higher and have separate pricing ($4/$10/$30 per 1K). |
domain* | string | Plain-English description of the subject area (e.g., "customer support for fintech"). Richer descriptions produce more focused data. |
count* | integer | Number of records to generate. Must be a positive integer. Billed per accepted record. |
model | string | Model shorthand: "fast", "balanced", "diverse", or "deep". Legacy aliases "economy", "standard", "premium" still accepted. Required unless using the verbose models object. |
models | object | Explicit { generator, critic } objects each with provider and model. Alternative to the model shorthand. |
system_prompt | string | Custom instructions appended to the built-in prompt (max 4,000 chars). Describe your AI's persona, tone, or constraints. |
verified | boolean | Enable LLM-as-judge quality scoring. Each record is scored on relevance, accuracy, and completeness (1โ10). Records below 6.0 are rejected. +$5/1K PAYG or 1.5ร quota on subscriptions. |
advanced | object | Advanced configuration: categories, coverage, invariants, response_policy, hard_negatives, output_split, and more. See Advanced Configuration. |
conflict_config | object | Required for conflict_v1 schema. Configures drive pairs, axis distribution, resolution mode, and vocabulary overrides. See conflict_config object below. |
constraints | object | Response length constraints: { min_tokens?, max_tokens? }. Tokens are roughly 0.75 words. |
license | string | License to embed in the manifest (e.g., "cc-by-4.0", "mit", "proprietary"). |
constraints object
Optional nested object for controlling the style and language of generated content.
| Field | Type | Description |
|---|---|---|
language | string | ISO 639-1 language code (e.g., "en", "es", "fr", "zh"). Generates content in the specified language. |
tone | string | Writing tone (e.g., "formal", "casual", "technical", "empathetic"). |
difficulty | string | Content complexity (e.g., "beginner", "intermediate", "expert"). |
conflict_config object
Required when schema is "conflict_v1". Configures the drive pairs, axis distribution, resolution behavior, and optional vocabulary overrides for multi-drive tension scenarios.
| Field | Type | Description |
|---|---|---|
drive_pairs* | array | Array of drive pair objects. Each pair has axis_name (string), optional description, and drives (tuple of exactly 2 objects with name and description). Example: [{ "axis_name": "safety_autonomy", "drives": [{ "name": "safety", "description": "Prioritize user protection" }, { "name": "autonomy", "description": "Respect user agency" }] }]. 1-20 pairs allowed. |
axis_distribution | object | Optional. Percentage of records per axis (e.g., { "safety_autonomy": 60, "honesty_helpfulness": 40 }). Must sum to approximately 100%. If omitted, records are distributed evenly across drive pairs. |
resolution_mode | string | Resolution annotation mode. "annotated" (default): full resolution with preferred_drive, confidence, override_condition, and reasoning. "graded": adds numeric scoring. "none": no resolution block in output. |
override_vocabulary | array | Optional. Array of structured label strings for override conditions (e.g., ["safety_critical", "user_explicit_consent", "legal_requirement"]). Max 50 items. |
{
"schema": "conflict_v1",
"domain": "AI assistant alignment decisions",
"count": 200,
"model": "deep",
"conflict_config": {
"drive_pairs": [
{
"axis_name": "safety_autonomy",
"drives": [
{ "name": "safety", "description": "Prioritize user protection..." },
{ "name": "autonomy", "description": "Respect user agency..." }
]
},
{
"axis_name": "honesty_helpfulness",
"drives": [
{ "name": "honesty", "description": "Provide accurate info..." },
{ "name": "helpfulness", "description": "Maximize usefulness..." }
]
}
],
"axis_distribution": { "safety_autonomy": 60, "honesty_helpfulness": 40 },
"resolution_mode": "annotated"
}
}models object (verbose form)
Use this instead of the model shorthand when you need exact provider/model control.
{
"schema": "preference_v1",
"domain": "code review best practices",
"count": 100,
"models": {
"generator": { "provider": "anthropic", "model": "claude-sonnet-4-6" },
"critic": { "provider": "openai", "model": "gpt-4o-mini" }
}
}The critic (used for preference_v1 scoring) defaults to GPT-4o Mini on all tiers for cost efficiency. You can override it via the models.critic object.
Advanced Configuration
Most advanced options are free. Pass them inside the advanced object. Omit the object entirely for simple jobs; nothing changes. The only option with a cost is invariants (~$0.04/1K records for the LLM checker).
Model shorthand
| Shorthand | Provider | Model | Price/1K |
|---|---|---|---|
| "fast" | OpenAI | gpt-4o-mini | $0.50 |
| "balanced" | OpenAI | gpt-4.1-mini | $3.00 |
| "diverse" | OpenAI | gpt-5.4-mini | $8.00 |
| "deep" | Anthropic | claude-sonnet-4-6 | $25.00 |
Legacy aliases "economy", "standard", and "premium" are still accepted and map to fast, balanced, and deep respectively. Note: deep is pay-as-you-go only and not included in subscription plans.
preference_v1 is available on balanced, diverse, and deep tiers. At least one of model or models is required.
Benchmark comparison
| Tier | Diversity % | Avg Output Depth | Near-Dup Rate |
|---|---|---|---|
| fast | 80% | ~612 chars avg | ~20% |
| balanced | 94% | ~600 chars avg | ~6% |
| diverse | 98% | ~1,085 chars avg | ~2% |
| deep | 82% | ~1,955 chars avg | ~18% |
Benchmarks measured on 1,000-record instruction_v1 jobs across 10 domains. Diversity % is the ratio of unique trigram sets after deduplication. Near-dup rate is the percentage of generated records removed by the diversity check (Jaccard > 0.7).
system_prompt
Append custom instructions to the built-in prompt. Maximum 4,000 characters. Your prompt is never allowed to replace or override the core format/safety instructions; it's added as "Additional Instructions from the User."
{
"schema": "instruction_v1",
"domain": "mental health peer support",
"count": 200,
"model": "deep",
"system_prompt": "You are a compassionate peer support assistant... Always validate feelings before offering advice. Never diagnose..."
}advanced.categories
Distribute records across named categories by percentage. The worker allocates records using the largest-remainder method (no rounding errors). Each category's description is injected into the generation prompt, enriching diversity within the category.
"advanced": {
"categories": {
"billing_disputes": { "percentage": 20, "description": "Charges, refunds, and billing errors" },
"account_access": { "percentage": 20, "description": "Login failures, 2FA, password resets" },
"feature_requests": { "percentage": 15, "description": "Users asking for product improvements" },
"bug_reports": { "percentage": 25, "description": "Unexpected errors or broken features" },
"general_questions": { "percentage": 20, "description": "How-to and product information" }
}
}Percentages must sum to approximately 100%. Category names become available as a from_category metadata field type.
BeginnerWhy category balance matters for training
If your training data has 90% easy examples and 10% hard ones, the model will optimize for the easy cases and perform poorly on the hard ones. Category distribution lets you intentionally oversample underrepresented scenarios.
Example: A customer support model trained on balanced billing/bug/feature data will handle all three equally well. Without categories, generation is stochastic; you might get 70% general questions and 5% billing disputes just by chance.
advanced.response_policy & style_rules
Define behavioral rules and writing style injected into the system prompt verbatim. Keys in response_policy become labeled policy sections.
"advanced": {
"response_policy": {
"safe_allowed": "Provide accurate, helpful information...",
"unsafe_disallowed": "Never provide medical diagnoses...",
"unclear_intent": "Ask a clarifying question rather than assuming the worst"
},
"style_rules": [
"Keep responses under 3 sentences when possible",
"Use active voice"
]
}advanced.hard_negatives
Generate adversarial examples designed to probe failure modes. Records are tagged with metadata.hard_negative: true and metadata.technique: "technique_name".
instruction_v1: Generates adversarial inputs, instructions designed to elicit harmful or incorrect outputs. The generated response shows the correct, safe handling. Use for safety and alignment training.
preference_v1: Generates hard negative responses: the rejected answer is plausible and well-written but contains subtle flaws (wrong facts, unsafe advice, logical errors). Essential for RLHF training where the model needs to distinguish good from subtly bad.
eval_v1: Generates adversarial test cases: trick questions, false premises, and edge cases designed to reveal model weaknesses.
Available techniques
educational_framingFrames harmful requests as academic studyfictional_framingRequests harmful info within a story contexthypothetical_framingUses 'what if' to lower the model's guardauthority_appealClaims professional authority to bypass refusalsemotional_manipulationUses urgency or distress to extract compliancegradual_escalationStarts safe and escalates incrementallyalready_know_disclaimerClaims prior knowledge to skip safety checksmisleading_contextProvides false context to justify unsafe requestsambiguous_phrasingUses deliberate ambiguity to obtain harmful contentedge_casesProbes boundary conditions and unusual inputsfactual_errorsIntroduces false premises requiring correctionlogical_fallaciesTests whether the model accepts flawed reasoningcontradictionsInternal contradictions that confuse model behaviorincomplete_informationLeaves out critical context to induce errors"advanced": {
"hard_negatives": {
"enabled": true,
"percentage": 25,
"techniques": [
"educational_framing",
"authority_appeal",
"misleading_context",
{ "name": "competitor_framing", "description": "Claims a competitor's AI does this freely" }
]
}
}Techniques can be built-in names (strings) or custom objects with name and description. Hard negatives work with or without categories. The percentage field accepts values from 1 to 100 (default: 20).
BeginnerWhy hard negatives make models more robust
A model trained only on clean, benign examples is brittle. It's never seen adversarial inputs, so it's easily fooled by even simple jailbreak attempts or misleading context.
Hard negatives are used extensively in safety research. Anthropic's Constitutional AI and Anthropic's RLHF pipeline both involve training on adversarially-generated preference data. Adding 20โ30% hard negatives to your SFT dataset typically improves robustness without degrading normal performance.
advanced.response_constraints
Control the length of generated responses. Both fields are optional; omit to use the model's natural response length.
"advanced": {
"response_constraints": {
"min_tokens": 100,
"max_tokens": 300
}
}Range: 1โ10,000 tokens. Token counts are approximate (1 token โ 0.75 words).
advanced.output_split
Automatically shuffle and split results into named files. Useful for generating train/validation sets in a single API call. Shuffling is deterministic (seeded) for reproducibility.
"advanced": {
"output_split": { "train": 80, "validation": 15, "test": 5 }
}Percentages must sum to exactly 100. You can define 2 to 5 named splits. Each value must be a positive integer. Download split files via GET /v1/synthetic/jobs/:jobId/splits.
BeginnerTrain/validation/test split best practices
Training set (70โ80%): Used to update model weights. More is better.
Validation set (10โ20%): Used to tune hyperparameters and decide when to stop training. Should never be seen by the model during training.
Test set (5โ10%): Held out completely until final evaluation. Reports the true generalization performance of your final model. Touch it only once; if you use it to make decisions, it becomes a validation set.
Generating all three from the same API call ensures they're drawn from the same distribution, which matters for clean evaluation.
advanced.metadata_fields
Attach custom metadata to every record. All types except llm_assessed are free post-generation operations (no LLM calls).
| type | Description | Options |
|---|---|---|
| auto_increment | Sequential integers (1, 2, 3โฆ) | prefix: optional string prefix (e.g., "cs_") |
| from_category | Category name the record was generated for | None |
| uuid | Random UUID v4 per record | None |
| constant | Same fixed value on every record | value: required string/number/bool |
| llm_assessed | LLM-scored field (1โ10 scale) | +$5/1K surcharge (waived if verified: true is already enabled, since it uses the same judge pass) |
"advanced": {
"metadata_fields": [
{ "name": "id", "type": "auto_increment", "prefix": "cs_" },
{ "name": "category", "type": "from_category" },
{ "name": "run_id", "type": "constant", "value": "march-2026-v1" },
{ "name": "uid", "type": "uuid" }
]
}advanced.invariants
Define rules that every generated record must satisfy. The worker runs a lightweight LLM check (GPT-4o Mini) on each record against your rules after generation. Two enforcement modes control what happens when a record violates a rule.
strict: The record is rejected outright if it violates the rule. It will not appear in your results. The rejection is counted in the quality report under invariant_violated.
soft: The record is kept but tagged with metadata.invariant_soft_flags listing which soft rules it violated. You can filter these out yourself during training if needed.
"advanced": {
"invariants": [
{ "rule": "Responses must not contain personal medical advice", "enforcement": "strict" },
{ "rule": "All outputs should include a disclaimer when discussing financial topics", "enforcement": "soft" }
]
}You can define 1 to 10 rules. Each rule text must be 10 to 500 characters. Invariant checking uses GPT-4o Mini and adds minimal cost (~$0.04 per 1K records). If the checker fails on a given record, the record passes through (fail-open design).
BeginnerWhen to use invariants vs. response_policy
response_policy tells the generator what to do and what to avoid. It is a best-effort instruction injected into the LLM prompt. Models usually follow it, but there is no enforcement after generation.
invariants are verified after generation by a separate LLM pass. A strict invariant guarantees that no record violating the rule will appear in your dataset. Use both together for defense-in-depth: the policy steers generation, and the invariant catches anything that slips through.
advanced.coverage
Ensure systematic coverage across multiple dimensions of variation. Instead of random sampling, coverage mode builds a structured grid and distributes records across it. Mutually exclusive with categories.
all_combinations: Computes the cross-product of all dimension values. In the example below, 3 difficulties x 3 topics = 9 cells. Records are distributed evenly across cells, with at least min_per_cell records in each. Maximum 100 cells.
each_value: Each value in each dimension appears at least once, but the full cross-product is not required. Use this when you have many dimensions and the combinatorial explosion would be too large.
"advanced": {
"coverage": {
"dimensions": [
{ "name": "difficulty", "values": ["easy", "medium", "hard"] },
{ "name": "topic", "values": ["safety", "privacy", "fairness"] }
],
"mode": "all_combinations",
"min_per_cell": 2
}
}1 to 5 dimensions allowed, each with 1 to 20 values. No additional cost; coverage is implemented through prompt enrichment. Each record's metadata includes the assigned coverage cell values (e.g., metadata.coverage_difficulty: "hard").
BeginnerCoverage vs. categories: which to use
Categories give you a single flat dimension (e.g., 25% billing, 25% account access, 50% bugs). Good when you want manual control over one axis of variation.
Coverage gives you multi-dimensional grids. If you need every combination of (difficulty x topic x persona) to appear in your training set, coverage is the right tool. The worker automatically computes the grid, distributes records, and enriches prompts with the cell context.
You cannot use both at the same time. If you pass both categories and coverage, the API returns a 400 error.
Full advanced example (with categories + invariants)
curl -X POST https://api.stackai.app/v1/synthetic/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema": "instruction_v1",
"domain": "AI safety, non-harm alignment training",
"count": 500,
"model": "deep",
"system_prompt": "You are a safety-focused AI assistant...",
"verified": true,
"advanced": {
"categories": {
"safety_critical": { "percentage": 30 },
"social_conflict": { "percentage": 25 },
"misinformation": { "percentage": 25 },
"benign": { "percentage": 20 }
},
"invariants": [
{ "rule": "Responses must never encourage self-harm or violence", "enforcement": "strict" },
{ "rule": "Responses should acknowledge uncertainty when not clear-cut", "enforcement": "soft" }
],
"hard_negatives": { "enabled": true, "percentage": 20 },
"output_split": { "train": 80, "validation": 20 },
"metadata_fields": [
{ "name": "id", "type": "auto_increment", "prefix": "nh_" },
{ "name": "category", "type": "from_category" }
]
}
}'Full advanced example (with coverage)
curl -X POST https://api.stackai.app/v1/synthetic/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema": "eval_v1",
"domain": "AI fairness evaluation",
"count": 200, "model": "diverse",
"advanced": {
"coverage": {
"dimensions": [
{ "name": "difficulty", "values": ["easy", "medium", "hard"] },
{ "name": "topic", "values": ["safety", "privacy", "fairness", "transparency"] },
{ "name": "persona", "values": ["expert", "beginner"] }
],
"mode": "all_combinations",
"min_per_cell": 2
},
"invariants": [
{ "rule": "Test cases must have exactly one correct answer", "enforcement": "strict" }
],
"output_split": { "train": 80, "test": 20 }
}
}'The coverage example above creates a 3 x 4 x 2 = 24-cell grid with at least 2 records per cell. Note that coverage replaces categories; you cannot use both in the same request.
Quality System
Every job runs automatic quality checks. Quality never blocks delivery; if checks fail, the data ships with the report attached.
Phase 1: Automatic checks (always on, free)
Format compliance
Validates field lengths, detects prompt leakage (model confusing instructions with output), truncation, and copy-paste errors.
format_invalidDiversity (deduplication)
Computes trigram Jaccard similarity between all record pairs. Records with similarity > 0.7 are flagged as near-duplicates and removed.
near_duplicatePreference checks
For preference_v1 only: validates that chosen/rejected scores differ by โฅ 2 points and that responses aren't too similar in wording.
low_margin / too_similarGrading rubric
| Grade | Pass Rate | Diversity Score | Verified: Mean Score |
|---|---|---|---|
| A | โฅ 95% | โฅ 0.80 | โฅ 8.0 / 10 |
| B | โฅ 85% | โฅ 0.60 | โฅ 7.0 / 10 |
| C | โฅ 70% | โฅ 0.40 | โฅ 5.0 / 10 |
| D | Below C | Below C | Below C |
Phase 2: Verified Quality (+$5/1K records)
Add verified: true to enable an LLM-as-judge second pass (GPT-4o Mini). Each record is scored 1โ10 on relevance, accuracy, and completeness. Records below 6.0 are rejected. Per-record scores are included in the results JSONL under provenance.quality.
curl -X POST https://api.stackai.app/v1/synthetic/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema": "instruction_v1",
"domain": "legal contract review",
"count": 200,
"model": "deep",
"verified": true
}'Quality report
{
"quality_report": {
"version": "1.0",
"phase": 2,
"total_generated": 215,
"total_accepted": 198,
"total_rejected": 17,
"pass_rate": 0.921,
"diversity_score": 0.87,
"mean_judge_score": 8.3,
"overall_grade": "A",
"rejection_breakdown": {
"format_invalid": 4,
"near_duplicate": 7,
"judge_rejected": 6
}
}
}BeginnerLLM-as-judge evaluation: what it is and why it works
Instead of relying only on rule-based checks, StackAI uses a separate LLM (GPT-4o Mini) to read each generated record and score it on three dimensions:
- Relevance: Does the response answer the actual question?
- Accuracy: Is the information factually correct?
- Completeness: Does it cover the key points?
Research by Zheng et al. (MT-Bench, 2023) showed that GPT-4 judgments correlate strongly with human preferences. Using a judge model that's different from the generator reduces bias from the generator's "style matching" tendencies.
Use Cases
Real-world patterns for common training scenarios. Click any example to copy the full curl command.
๐ก๏ธ AI safety & alignment training
Generate instruction data with categories covering unsafe inputs and hard negatives for robustness. Use verified: true for highest data quality.
{
"schema": "instruction_v1",
"domain": "AI assistant safety and alignment",
"count": 1000, "model": "deep", "verified": true,
"advanced": {
"categories": {
"harmful_request_refusal": { "percentage": 35 },
"misinformation_correction": { "percentage": 25 },
"privacy_protection": { "percentage": 20 },
"benign_helpfulness": { "percentage": 20 }
},
"hard_negatives": { "enabled": true, "percentage": 30 },
"output_split": { "train": 80, "validation": 20 }
}
}๐ Multi-drive alignment conflict training
Generate structured tension scenarios where competing drives (safety vs. autonomy, honesty vs. helpfulness) must be resolved with confidence scores and override conditions. Available on balanced, diverse, and deep tiers.
{
"schema": "conflict_v1",
"domain": "AI assistant alignment decisions in sensitive contexts",
"count": 500, "model": "deep", "verified": true,
"conflict_config": {
"drive_pairs": [
{ "a": "safety", "b": "autonomy" },
{ "a": "honesty", "b": "helpfulness" },
{ "a": "privacy", "b": "transparency" }
],
"axis_distribution": { "safety_autonomy": 40, "honesty_helpfulness": 35, "privacy_transparency": 25 },
"resolution_mode": "annotated"
},
"advanced": {
"output_split": { "train": 80, "validation": 20 }
}
}๐ฌ Domain-specific chatbot fine-tuning
Build a customer support or specialist assistant. Use system_prompt to define the persona and response_policy to encode business rules.
{
"schema": "instruction_v1",
"domain": "technical support for cloud infrastructure",
"count": 500, "model": "balanced",
"system_prompt": "You are Aria, a friendly cloud support engineer at CloudCo...",
"constraints": { "tone": "friendly", "difficulty": "mixed" },
"advanced": {
"categories": {
"billing_and_costs": { "percentage": 20 },
"networking": { "percentage": 25 },
"compute_and_scaling": { "percentage": 30 },
"storage": { "percentage": 25 }
}
}
}โ๏ธ RLHF / DPO preference training
Generate chosen/rejected pairs for DPO training. Available on balanced, diverse, and deep tiers. Use verified: true; quality matters most for preference data since bad preference pairs teach the wrong direction.
{
"schema": "preference_v1",
"domain": "writing assistance and editing",
"count": 300, "model": "deep", "verified": true,
"advanced": {
"hard_negatives": { "enabled": true, "percentage": 20 },
"output_split": { "train": 90, "validation": 10 }
}
}๐ Multi-language training data
Generate data in any language using the constraints.language field. Submit multiple jobs (one per language) for balanced multilingual training sets.
# Spanish dataset
{ "schema": "instruction_v1", "domain": "e-commerce customer support",
"count": 200, "model": "balanced",
"constraints": { "language": "es" },
"advanced": { "metadata_fields": [{ "name": "lang", "type": "constant", "value": "es" }] }
}
# French dataset
{ "constraints": { "language": "fr" }, "advanced": { "metadata_fields": [{ "name": "lang", "type": "constant", "value": "fr" }] } }Output & Provenance
Results are returned as JSONL (one JSON object per line). Each record contains the schema fields, any custom metadata you configured, and a provenance object.
instruction_v1 record (full)
{
"instruction": "How do I safely handle a pan fire in my kitchen?",
"input": "",
"output": "Cover the pan with a lid to cut off oxygen. Never use water on a grease fire...",
"metadata": { "id": "nh_042", "category": "safety_critical", "hard_negative": false },
"provenance": {
"job_id": "syn_job_a1b2c3d4e5",
"schema": "instruction_v1",
"domain": "household safety",
"generated_at": "2026-03-11T12:34:56Z",
"model": "claude-sonnet-4-6",
"quality": { "relevance": 9.5, "accuracy": 9.0, "completeness": 8.5, "overall": 9.0 }
}
}preference_v1 record (full)
{
"prompt": "What's the safest way to store passwords in a database?",
"chosen": "Use bcrypt, scrypt, or Argon2 with a per-user random salt...",
"rejected": "SHA-256 with a salt is a good option. It's fast and widely supported...",
"chosen_score": 9.5,
"rejected_score": 3.0,
"reasoning": "The chosen response correctly recommends purpose-built password hashing...",
"metadata": {},
"provenance": { "schema": "preference_v1", "model": "claude-sonnet-4-6", ... }
}manifest.json
Every job also produces a manifest.json that summarizes the job parameters and quality report. Useful for provenance tracking in ML pipelines.
{
"job_id": "syn_job_a1b2c3d4e5",
"schema": "instruction_v1",
"count_accepted": 481,
"count_rejected": 19,
"quality_report": { "overall_grade": "A", "pass_rate": 0.962 },
...
}API Reference
Base URL: https://api.stackai.app. All endpoints require an Authorization: Bearer YOUR_API_KEY header unless marked Public.
POST /v1/synthetic/generateCreate a new generation job. Returns job_id and initial status immediately. Generation runs asynchronously.
GET /v1/synthetic/jobsList your jobs. Supports ?status= (queued/running/succeeded/failed) and ?limit= filters.
GET /v1/synthetic/jobs/:jobIdGet job status, counts, quality grade, and summary. Poll this until status is "succeeded" or "failed".
GET /v1/synthetic/jobs/:jobId/resultsStream the JSONL results file directly. Add ?format=url for a presigned S3 download URL (valid 1 hour).
GET /v1/synthetic/jobs/:jobId/quality-reportReturns a presigned URL for the detailed quality report JSON file.
GET /v1/synthetic/jobs/:jobId/splitsReturns presigned download URLs for each named split (e.g., train, validation) when output_split was configured.
GET /v1/synthetic/pricingPublicReturns current pay-as-you-go pricing by schema and quality tier. Public endpoint.
GET /healthPublicService health check. Returns status of API, database, queue, and email services.
Error Handling
The API uses standard HTTP status codes. Error bodies always include a message and optionally a code for programmatic handling.
Common status codes
| Status | Code | Meaning & Fix |
|---|---|---|
| 401 | Missing or invalid API key. Check your Authorization header. | |
| 402 | INSUFFICIENT_BALANCE | PAYG balance too low, or subscription quota exhausted. |
| 403 | EMAIL_NOT_VERIFIED | Your account email isn't verified. Check your inbox. |
| 422 | DOMAIN_SPELL_CHECK | Possible typo in domain. Re-submit with X-Domain-Confirmed: true to override. |
| 422 | VALIDATION_ERROR | Request body failed validation. Check the errors array in the response. |
| 429 | Rate limit exceeded (60 req/min per org). Add backoff and retry. | |
| 5xx | Server-side error. Safe to retry with exponential backoff. |
Domain spell-check flow (422)
If your domain contains possible misspellings, the API returns a 422 with suggestions before creating the job and consuming quota. Re-submit with the corrected domain or override the check if your spelling is intentional (technical terms, brand names, etc.).
# Add X-Domain-Confirmed: true to bypass the spell check
curl -X POST https://api.stackai.app/v1/synthetic/generate \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-H "X-Domain-Confirmed: true" \
-d '{ "schema": "instruction_v1", "domain": "pytorch model quantization with QLoRA", ... }'Polling pattern
import time, requests
headers = {"Authorization": "Bearer YOUR_API_KEY"}
JOB_ID = "syn_job_xxxxxxxxxxxxxxxxx"
while True:
job = requests.get(f"https://api.stackai.app/v1/synthetic/jobs/{JOB_ID}", headers=headers).json()
if job["status"] == "succeeded":
url = requests.get(f".../{JOB_ID}/results", headers=headers, params={"format": "url"}).json()["url"]
print("Download:", url); break
elif job["status"] == "failed":
print("Failed:", job.get("error")); break
time.sleep(5)Authentication
API keys
API keys are used for programmatic access (curl, scripts, CI/CD). Keys have the format sk_... and are shown once on creation.
1. Go to your dashboard โ API Keys โ Create Key.
2. Copy the key immediately; it won't be shown again.
3. Pass it in every request:
curl https://api.stackai.app/v1/synthetic/jobs \ -H "Authorization: Bearer sk_live_your_key_here"
Rate limits
60 requests per minute per organization. Generation jobs count as 1 request regardless of record count. Exceeding the limit returns HTTP 429. Add exponential backoff and retry.
Glossary
Key terms from LLM training and alignment research. Bookmark this for reference while reading papers or planning your training pipeline.
Ready to start generating?
Free tier includes 100 records/month. No credit card required to start.