Generate Datasets from a Prompt: The DeepFabric Claude Code Skill
If you've used DeepFabric before, you know the drill: install the package, write a YAML config, figure out the right depth and degree for your knowledge graph, pick a conversation type, deal with provider keys, maybe set up a Spin server for mock files. It's not complicated once you know what you're doing. But there's a real cost to bootstrapping all of that every time you start a new dataset project.
The DeepFabric Claude Code skill removes that friction entirely. You describe your topic. Claude handles the rest.
What is a Claude Code Skill?
Claude Code skills are structured instruction sets that extend Claude Code's capabilities for specific workflows. When a skill is installed, Claude Code loads its instructions and can follow them step-by-step - asking the user targeted questions, running commands, creating files, and validating outputs along the way.
The DeepFabric generator skill is a purpose-built skill that guides you through the entire dataset creation process: from gathering your requirements, to running deepfabric generate, to producing a ready-to-use fine-tuning script.
Installing the Skill
Clone the skill into your Claude Code skills directory. If you have it as a folder already, it's just a copy:
bash
Or if you're setting up a project-level skill (which keeps it scoped to the current repo):
bash
Thats it!
The directory structure should look like this:
deepfabric-generator/
├── SKILL.md
├── README.md
└── assets/
├── config_template.yaml
└── template_train.py
Once installed, the skill is automatically available the next time you start Claude Code.
Before You Start: Export Your API Key
Before launching Claude Code, export your API key in the same terminal session. The skill uses the key to call the frontier model during generation, and it needs to be set in the environment before you start.
For Gemini (recommended - has a free tier):
bash
For Anthropic:
bash
Then launch Claude Code from that same terminal. If you start Claude Code first and export the key after, it won't be picked up. To avoid having to do this every session, add the export to your shell profile (~/.zshrc or ~/.bashrc).
The Workflow
Tell Claude Code you want to generate a dataset:
"Generate a dataset about building REST APIs in Python"
Claude Code picks up the skill and starts walking you through setup.
Step 1: Understanding the Knowledge Graph
Before asking for your settings, the skill explains how DeepFabric structures data generation using a knowledge graph, so you understand what you're actually configuring. For example, if your topic is "building REST APIs in Python", the explanation looks like:
REST APIs in Python
├── Authentication -> JWT, OAuth2, API Keys
├── Request Handling -> Routing, Validation, Middleware
└── Data Layer -> ORM, Migrations, Query Optimization
Depth is how many levels deep the graph goes. Degree is how many branches come off each node. Together they determine the number of unique subtopics, and the diversity of your dataset. The skill shows you exactly how many topics each combination produces before you choose.
You pick depth and degree from an interactive menu rather than typing raw numbers.
Step 2: Provider and Customization
The skill then asks which API provider to use (Gemini or Anthropic), and whether you want it to pick the best defaults for your topic or walk you through each option manually.
If you choose defaults, Claude Code analyzes your topic and picks sensible values. For a coding topic like REST APIs, it would choose instruction-following conversations with chain-of-thought reasoning, and recommend including mock files so the dataset includes realistic file operations.
The defaults are explained to you before anything is created, so you can push back if something feels off.
If you choose to customize, you're walked through:
- Conversation type - Q&A, instruction-following, or chat
- Reasoning style - chain-of-thought, direct, or step-by-step
- Number of samples - matched to your topic count or scaled up for richer coverage
- Mock files - whether to include simulated file I/O in training conversations
Step 3: The Generated Config
All of this feeds into a YAML config that DeepFabric can read. Here's an example for a REST API dataset with mock files enabled:
yaml
The config is created in a dedicated folder named after your topic. All output files land there too, keeping things clean.
Step 4: Dataset Generation
With the config ready, the skill runs:
bash
The --tui simple flag makes progress visible in Claude Code's terminal. You'll see topics being generated, then samples streaming in as DeepFabric distills conversations from the frontier model. If mock files are enabled, you'll see the file tool calls happening in real time.
Generation typically takes 25-30 minutes for a balanced depth/degree config. Once it finishes, the skill checks that dataset.jsonl exists, is non-empty, and has roughly the expected sample count. If something went wrong mid-run, it investigates before moving on.
Step 5: Training Script
The last thing the skill creates is a fine-tuning script using TRL's SFTTrainer:
python
The default model is Qwen/Qwen3-0.6B - small enough to train locally, capable enough to verify the dataset works. The skill asks if you want to swap it out before finishing.
What You End Up With
For each run, you get a self-contained project folder:
| File | Description |
|---|---|
<topic>_config.yaml | DeepFabric configuration |
topics.jsonl | Generated knowledge graph |
dataset.jsonl | Synthetic training dataset |
<topic>_train.py | Fine-tuning script |
Everything is in one place. You can inspect the topics file to understand what subtopics were generated, examine the dataset to sanity-check sample quality, and run the training script directly once you're happy.
Mock Files: What They Are and When to Use Them
When mock files are enabled, each training sample isn't just a Q&A exchange. It's a multi-step agent conversation where the model reads files, reasons about them, writes code, and explains its decisions. The "files" are seeded into a Spin server that the model can actually call during generation.
This produces richer, more realistic training data for models that will operate as coding assistants or agents. The trade-off is you need a running Spin server:
bash
The skill handles the rest: verifying the server is up, setting the right endpoint in the config, and stopping the server when generation is done (if you ran it locally).
For non-coding topics - recipes, history, math concepts - mock files don't add value and you can skip them entirely.
The Bigger Picture
The skill is opinionated in a good way: it makes the easy path the right path. You don't need to read the DeepFabric docs to get sensible outputs. The defaults are chosen based on your topic, explained before they're applied, and easy to override.
For teams using DeepFabric regularly, the skill means new members can generate their first dataset in a single conversation without needing to understand the full config schema. For solo developers, it means less time on boilerplate and more time evaluating and improving the data.
Resources
- DeepFabric GitHub: https://github.com/always-further/deepfabric
- DeepFabric Docs: https://docs.deepfabric.dev/
- Claude Code Skills: https://docs.anthropic.com/en/docs/claude-code/skills