D-Day Quiz: A From-Scratch LLM Passes a Historical Exam
Too long, didn't read
I started working to build my own language model on September 15, 2025 from scratch. Today is May 3, 2026, it's been a little over seven months since I started. Fast-forward to today, I've got a 597M parameter custom GPT architecture language model with a 32,768 token context with a custom 34K tokenizer and the model is passing short quizzes with 100% accuracy. Below is the video of the light-hearted & fun test.
It's worth noting that the questions in the quiz are not in the model's training data verbatim. This WWII model was trained on 301 carefully selected Wikipedia articles - all knowledge that it has came from those. The model is able to answer the test because it's successfully generalized the knowledge from those articles via seeing the same facts represented in nearly 200 different ways.
Take the quiz yourself, it's on the Department of War's website: https://www.defense.gov/Multimedia/Quizzes/Quiz/article/1859208/whats-your-d-day-iq/
This is not a fine-tune
Going to be upfront here, this isnt a fine-tune of an existing open-weights model. I've come from an infrastructure engineering background and my goal here was to learn about the entire language model lifecycle, so I've skipped all easy paths. I built the data distillation pipeline that used those 301 wiki articles + Qwen + a heaping mountain of Python that produced 2.91B tokens of private training data. It totals in 17,575,507 individual questions and answers. I iterated on dozens of model architectures, changing layer count, width, number of attention heads, context sizes, customized training schedulers that are loss based instead of cosine based. Implemented multi-document context training with diagonal masking, focal loss gamma for surgical per-token gradient updates as the norm, and implemented my own testing harness and monitoring framework, among a ton of other supporting functions that made all of this survive power outages and support exports and rollbacks and custom monitoring of loss validation and weights distribution.
A quick breakdown:
- Ground-up GPT architecture
- Custom 34K Tokenizer
- Zero public datasets
- Synthetic Distillation using open weights models, primarily Qwen
- No standard pre-training, the model was not trained on raw text. It was only trained on QA
- Custom tooling for everything
Whats next?
This model isnt perfect. There are question types it fails on, despite having the underlying knowledge. I have approaching 200 different QA types of diverse composition and complexity and structure. The model can answer questions structured in a way that it's seen, but not questions structured in a way it's not seen. It needs more QA examples to cover those edge cases. This could be pursued endlessly. Though seeing a from-scratch language model output coherent factually accurate responses to unseen questions is a massive personal milestone. It's time to move on to a new endeavor, though. While WWII was pivotal in shaping all world history that followed, the effort has served its true purpose, teaching me how to do all of this, how to train AI to solve specific problems from nothing. I'm moving on to other projects now, some are private and others I'll make public like this one.
- Book Language Models for books in the public domain. First up is Black Beauty, this 148 year old novel still moves me today and I feel it's a critical piece of literature to show all of us why we should treat animals far better.
- Multi-variant time-series prediction models. Creating a model that accepts all the numbers from all SEC EDGAR 10-K and 10-Q filings and predicts all of the next quarter's numbers for all companies. Yes, I'm coming for the stock market. This is actually underway. Fun-fact, SEC EDGAR filings since 2000 are 3.4TB compressed. Lot of data there, early test results are positive.
- Linux diagnostic LLM - but it may be a long time until I get around to this.
Please subscribe if you want an email notice when a new blog is posted.