My very first dataset on Hugging Face

June 19, 2026 at 1:17 PM CDTWayne Workman3 min read

Today I did what I have been saying for a long time, I published my WWII synthetic data on HuggingFace.

wayneworkman2012/ww2-synthetic-corpus

This is my synthetic World War II training corpus. It's over 21 million conversational question:answer pairs and spans 29 subsets and is licensed as CC BY-SA 4.0.

Absorbing the moment

In November I was fumbling through my first tries at building a small language model. And I mean really small, I started off in the 20M to 40M parameter range. Failures raging, models that could barely speak english and got every question wrong. This data set grew out of that. I built a whole pipeline for turning raw text into training data. The sheer volume it eventually amounted to led my open-source-mindedness to want to share it. And so I did. Today on my day off from work, I spent the morning getting the data into the proper format that the Hugging Face community actually uses, and built a dataset card I'm proud of. This really feels like a milestone. And now that it's out there, anyone can use it and build on it.

What was included

All of the knowledge came from 301 carefully selected Wikipedia articles, not from a model. The LLMs I used were all permissively licensed open-weights models and were used to transform the Wikipedia article contents into training data of various shapes and sizes. I used Qwen mostly, but some materials were generated by DeepSeek and Kimi.

The variety is a lot. Lots of WH questions - who, what, when, where, how. descriptive naswers, quiz formats, adversarial wrong entity / wrong date questions, a lot of temporal reasoning, negation - i.e. what did not happen, unanswerable questions, decision rationale questions, relationship conversations, full article rewrites, safety QA pairs. Every row of data is ChatML formatted messages conversation. SO it drops straight into modern fine-tuning. Like this:

from datasets import load_dataset

ds = load_dataset("wayneworkman2012/ww2-synthetic-corpus", "qa_pairs", split="train")
print(ds[0]["messages"])

Everything was done right

I really wanted the first dataset I put out there to be clean to build on. So the provenance was taken pretty seriously. Everything that was released was generated by permissively licensed models. Qwen is Apache 2, DeepSeek is MIT, Kimi is MIT. Early experiments that used a more restrictive model were regenerated from scratch using a permissive model. Source articels are cited where possible on each data row, and the source Wikipedia articles used are cited in the data card.

Public !!

It is public, permissive, and the dataset card walks you through what you need to know.

If you do anything with it, I would really love to hear about it. Hopefully this is the first of many.

← Back to Blog