D-Day Quiz: A From-Scratch LLM Passes a Historical Exam

May 03, 2026 at 12:00 PM CDTWayne Workman3 min read

Too long, didn't read

I started working to build my own language model on September 15, 2025 from scratch. Today is May 3, 2026, it's been a little over seven months since I started. Fast-forward to today, I've got a 597M parameter custom GPT architecture language model with a 32,768 token context with a custom 34K tokenizer and the model is passing short quizzes with 100% accuracy. Below is the video of the light-hearted & fun test.

It's worth noting that the questions in the quiz are not in the model's training data verbatim. This WWII model was trained on 301 carefully selected Wikipedia articles - all knowledge that it has came from those. The model is able to answer the test because it's successfully generalized the knowledge from those articles via seeing the same facts represented in nearly 200 different ways.

Take the quiz yourself, it's on the Department of War's website: https://www.defense.gov/Multimedia/Quizzes/Quiz/article/1859208/whats-your-d-day-iq/

This is not a fine-tune

Going to be upfront here, this isnt a fine-tune of an existing open-weights model. I've come from an infrastructure engineering background and my goal here was to learn about the entire language model lifecycle, so I've skipped all easy paths. I built the data distillation pipeline that used those 301 wiki articles + Qwen + a heaping mountain of Python that produced 2.91B tokens of private training data. It totals in 17,575,507 individual questions and answers. I iterated on dozens of model architectures, changing layer count, width, number of attention heads, context sizes, customized training schedulers that are loss based instead of cosine based. Implemented multi-document context training with diagonal masking, focal loss gamma for surgical per-token gradient updates as the norm, and implemented my own testing harness and monitoring framework, among a ton of other supporting functions that made all of this survive power outages and support exports and rollbacks and custom monitoring of loss validation and weights distribution.

A quick breakdown:

Whats next?

This model isnt perfect. There are question types it fails on, despite having the underlying knowledge. I have approaching 200 different QA types of diverse composition and complexity and structure. The model can answer questions structured in a way that it's seen, but not questions structured in a way it's not seen. It needs more QA examples to cover those edge cases. This could be pursued endlessly. Though seeing a from-scratch language model output coherent factually accurate responses to unseen questions is a massive personal milestone. It's time to move on to a new endeavor, though. While WWII was pivotal in shaping all world history that followed, the effort has served its true purpose, teaching me how to do all of this, how to train AI to solve specific problems from nothing. I'm moving on to other projects now, some are private and others I'll make public like this one.

Please subscribe if you want an email notice when a new blog is posted.

← Back to Blog