YAML vs JSON: The Hidden Token Tax That's Costing You Money
So I took an existing project I had and completely overhauled it after Mike (@satwareAG-ironMike) caught a critical mistake in my original implementation. I was pretty-printing the JSON before sending it to the models, which was completely wrong. Mike rightfully pointed out that JSON should always be sent compact/minimized to AI models, not pretty-printed with all that extra whitespace. When you're sending JSON to a model or requesting JSON output, it needs to be compact to get accurate token counts and fair comparisons. I'm genuinely thankful Mike caught this, it would have completely invalidated the results. After fixing that and overhauling the tool, I decided to settle a different question: which format costs more when feeding data to AI models, JSON or YAML?
Spoiler alert: YAML is costing you 6-10% more in token usage. Thats real money if you're processing millions of requests.
Why This Matters More Than You Think
Look, when you're paying per token for AI model usage, every character counts. AWS Bedrock, OpenAI, Anthropic, they all charge by the token. And what most people dont realize is that YAML's human-friendly formatting comes with a hidden cost, it uses more tokens than JSON for the exact same data, which means you're literally paying extra for those nice indentations and lack of brackets.
After the corrections, I overhauled the tool to check token-in counts specifically for AWS Bedrock and added some light comprehension testing to see how models handle the two different data formats. The tool analyzes GitHub user events (90 days worth) and processes them through various AI models in both JSON and YAML formats. The results were... well, lets just say they were eye-opening.
You can find the whole project here: github.com/wayneworkman/yaml_json_tokenization_analyzer
What The Tool Actually Does
The analyzer pulls your public GitHub events from the last 90 days and then does a bunch of interesting things:
- Converts the data between JSON and YAML formats
- Counts tokens using either AWS Bedrock's Converse API or tiktoken
- Tests multiple AI models on their ability to answer questions about the data
- Compares accuracy between formats (this is where things get weird)
- Shows token counts and percentage differences between JSON and YAML
Oh and speaking of the Converse API, if you havent switched to it yet for AWS Bedrock, you're making your life harder than it needs to be. I've blogged about this before but seriously, model swapping becomes trivial with the Converse API.
The Surprising Results
Running this against my own GitHub activity revealed some fascinating patterns. First, the token counts tell an interesting story, YAML consistently uses 6-10% more tokens than JSON for identical data which if you're processing thousands or millions of API calls through AI models that token tax adds up fast and I mean really fast when you consider enterprise scale operations.
Token Count Comparison Across Models
| Model | Raw JSON | Raw YAML | Reduced JSON | Reduced YAML | JSON vs YAML (Raw) | JSON vs YAML (Reduced) |
|---|---|---|---|---|---|---|
| nova-lite | 43,250 | 45,915 | 3,540 | 3,834 | +6.2% | +8.3% |
| nova-micro | 43,762 | 46,427 | 3,540 | 3,834 | +6.1% | +8.3% |
| nova-premier | 43,336 | 46,679 | 3,567 | 3,943 | +7.7% | +10.5% |
| nova-pro | 43,250 | 45,915 | 3,540 | 3,834 | +6.2% | +8.3% |
| claude-sonnet-4 | 37,911 | 40,385 | 3,579 | 3,815 | +6.5% | +6.6% |
| claude-opus-4.1 | 37,911 | 40,385 | 3,579 | 3,815 | +6.5% | +6.6% |
But here's where it gets really interesting. Model accuracy varies wildly between formats. Some models actually perform better with YAML despite the higher token count. Nova models in particular showed this weird preference. Meanwhile, Claude models generally performed better with JSON.
And get this: Claude Sonnet 4 outperformed Claude Opus 4.1 in both formats. Sonnet 4 scored 93.3% with JSON and 76.7% with YAML, while Opus 4.1 only managed 73.3% with JSON and 66.7% with YAML. The cheaper model beat the more expensive one. Makes you wonder what you're really paying for sometimes, doesn't it?
Comprehension Test Results
| Model | JSON Parse | JSON Accuracy | YAML Parse | YAML Accuracy | JSON Tokens (In/Out) | YAML Tokens (In/Out) |
|---|---|---|---|---|---|---|
| nova-lite | ✓ | 40.0% | ✓ | 56.7% | 3,779/102 | 4,074/135 |
| nova-micro | ✓ | 43.3% | ✓ | 43.3% | 3,779/125 | 4,074/113 |
| nova-premier | ✓ | 63.3% | ✓ | 63.3% | 3,818/111 | 4,197/123 |
| nova-pro | ✓ | 56.7% | ✓ | 80.0% | 3,779/126 | 4,074/113 |
| claude-sonnet-4 | ✓ | 93.3% | ✓ | 76.7% | 3,830/136 | 4,068/115 |
| claude-opus-4.1 | ✓ | 73.3% | ✓ | 66.7% | 3,830/136 | 4,068/115 |
The 80% Token Reduction Opportunity
Something interesting I noticed while analyzing the data, by stripping out unnecessary GitHub metadata (stuff like URLs, IDs, and fields you'll never use), you could reduce your token count by up to 80%. Thats not a typo. EIGHTY PERCENT.
Think about what that means for your AWS bill. Or your OpenAI bill. Whatever you're using. You could be processing 5x more data for the same cost, or saving 80% on your current AI processing costs.
The token count comparisons really drive this home. Most of the time its metadata you dont even need for your use case. URLs alone can account for 30-40% of your token usage in GitHub event data.
Model Performance: The Accuracy Problem
Heres something that genuinely surprised me. These AI models, as smart as they are, struggle with seemingly simple counting tasks. I asked them five basic questions about the GitHub event data:
- How many total events are in this dataset?
- What is the most common event type and how many times does it occur?
- How many PushEvents contain more than 2 commits?
- How many different event types are there in total?
- List all event types and their counts in descending order by count
You'd think these would be gimmes for AI models that can write code and explain quantum physics. But no. Accuracy varied wildly. Nova Pro got 80% right with YAML but only 56.7% with JSON. Claude Sonnet 4 hit 93.3% accuracy with JSON but dropped to 76.7% with YAML.
Question 1, just counting total events? Every single model failed it except Claude Sonnet 4. Thats right, Nova Lite, Nova Micro, Nova Premier, Nova Pro, even Claude Opus 4.1, they all got the simple "count the total events" question wrong. Only Sonnet 4 managed to count correctly, and it got it right in both JSON and YAML formats.
Question 3 about PushEvents with more than 2 commits? Also brutal. Most models completely whiffed on this one. And question 5 asking for a sorted list? The scores ranged from 1.0 to 1.8 out of 2, meaning almost nobody got it completely right.
Detailed Accuracy Breakdown
| Model | Format | Q1: Total | Q2: Event Type | Q3: Push >2 | Q4: Event Types | Q5: Distribution | Overall |
|---|---|---|---|---|---|---|---|
| nova-lite | JSON | ✗ | ✓ | ✗ | ✗ | 1.4 | 40.0% |
| YAML | ✗ | ✓ | ✗ | ✓ | 1.4 | 56.7% | |
| nova-micro | JSON | ✗ | ✓ | ✗ | ✗ | 1.6 | 43.3% |
| YAML | ✗ | ✓ | ✗ | ✗ | 1.6 | 43.3% | |
| nova-premier | JSON | ✗ | ✓ | ✓ | ✗ | 1.8 | 63.3% |
| YAML | ✗ | ✓ | ✗ | ✓ | 1.8 | 63.3% | |
| nova-pro | JSON | ✗ | ✓ | ✗ | ✓ | 1.4 | 56.7% |
| YAML | ✗ | ✓ | ✓ | ✓ | 1.8 | 80.0% | |
| claude-sonnet-4 | JSON | ✓ | ✓ | ✓ | ✓ | 1.6 | 93.3% |
| YAML | ✓ | ✓ | ✗ | ✓ | 1.6 | 76.7% | |
| claude-opus-4.1 | JSON | ✗ | ✓ | ✓ | ✓ | 1.4 | 73.3% |
| YAML | ✗ | ✓ | ✓ | ✓ | 1.0 | 66.7% |
Real-World Impact
So whats the takeaway for engineering teams and, more importantly, for leadership making technology decisions?
First, if you're using YAML for AI model inputs, you're paying a premium. Maybe thats worth it for the readability, maybe not. But at least now you know the cost.
Second, model selection matters more than most people realize. That expensive model might not be giving you better results. The tool I built lets you test this with your own data.
Third, and this is huge for cost optimization, look at your data preprocessing. Stripping unnecessary fields before sending to AI models is basically free money. Its the easiest optimization you'll ever make.
Try It Yourself
The tool is open source and pretty easy to run. Just clone the repo, set up your AWS credentials (if using Bedrock), and point it at any GitHub username:
python analyzer.py wayneworkman
It'll generate detailed ASCII tables showing token counts, accuracy scores, and performance breakdowns. Perfect for dropping into a Slack channel when someone asks why the AI bill is so high.
You can even use it with OpenAI's tiktoken if you dont have AWS Bedrock access. Though honestly, if you're not using Bedrock's Converse API yet, you're missing out on the easiest model swapping experience available.
The Bottom Line
Data format matters more than we thought. A 6-10% token tax might not sound like much, but when you're processing millions of requests, thats real money. And the accuracy differences between formats? That could mean the difference between useful AI features and frustrated users.
The good news is these are solvable problems. Test your models with both formats. Strip unnecessary data. Use the Converse API for easy model swapping. Small optimizations that add up to big savings.
Check out the project and run it against your own GitHub data. I'd be curious to hear if you see similar patterns. And if you find any models that actually handle that weekend counting question correctly, definitely let me know because that one's been driving me crazy.
Project link: github.com/wayneworkman/yaml_json_tokenization_analyzer