Claude Sonnet 4.5 on AWS Bedrock: The New Default for CloudWatch Alarm Triage
Yesterday, September 29th, Anthropic released Claude Sonnet 4.5. Same day, it showed up on AWS Bedrock as a foundational model. Not like a week later or a month later like these cloud provider releases used to be, but the same day. That gives everyone with AWS access instant availability to whats probably the world's most advanced coding agent, at Sonnet pricing which is way cheaper than Opus and actually performs better too.
Naturally I wanted to update my terraform-aws-module-cloudwatch-alarm-triage module to use it. Previously with v0.2.1 the default model ID was set to the US cross-region inference endpoint for Claude Opus 4.1, which was the previous best code-generating model out there. But Sonnet 4.5 changes all that, and I couldn't wait to test it.
Quick refresher on how the triage module works
The module uses all-serverless AWS components to automatically investigate CloudWatch alarms using Bedrock AI. When a CloudWatch alarm triggers, it can send the event to various destinations via the ALARM actions capability, Lambda being one of those supported destinations.
The module has two Lambdas. An orchestrator which recieves the alarm event from CloudWatch, and a tool Lambda that safely executes Python code within strict read-only IAM permissions (no access to secrets or S3 objects or anything sensitive). The orchestrator invokes a Bedrock model via the Converse API, instructing it to investigate the alarm by writing Python and requesting execution via the Python tool.
Back and forth this goes until the model figures out the underlying issue and typically provides recommendations on how to fix it. Works in any AWS account with no knowledge base required because the transparency offered by public cloud APIs like AWS allows the model to just follow the trail and you can configure the IAM permissions to be as restrictive or as permissive as your comfortable with, keeping in mind that less permissions means less investigation capability.
For example, a log error might lead the model to check a specific Lambda, then look at that Lambda's logs, find another error about permissions, go check the Lambda's IAM role and see its missing a permission, then look at CloudTrail to determine when that permission was removed and by who and through what method, and finally conclude something like "The xyz Lambda's IAM role was changed at 14:32:18 today during a deployment, it lost the necessary abc permission and now its failing, you can correct it by updating the role to include the abc permission again and maybe check your IaC to see why the permission got removed." That's a really generalized example but its genuinely how this works, it investigates like a person would but way faster and the report gets sent to an SNS topic which forwards to whatever email addresses or distribution lists you have subscribed.
The result is that when an engineer gets paged at 3 AM on a Saturday they open their laptop and theres already a triage report sitting in their inbox giving them a solid indication of whats going on from an agnostic viewpoint and they immediately know what to look into or what they might do to fix the issue, which has huge impact on Mean Time To Recovery.
Testing Sonnet 4.5 by breaking my blog
So I changed the default model in the module to the global cross-region Sonnet 4.5 endpoint, deployed it to my personal AWS account (which is where this blog runs), and then I went and purposely broke my blog's Contact API. Yes it was fun.
What I did was change the IAM role on my Contact API Lambda. It needs permissions to read my Google reCAPTCHA v2 secret from AWS Secrets Manager. I changed secretsmanager:GetSecretValue in the IAM policy to secretsmanager:GrabSecretValue, which is a completely made-up permission that doesnt exist.
Then I tried my contact form. Expectedly, when I clicked submit, it gave me a generic error message, not revealing too much info, just that it couldnt complete the request. But in the background, an AWS CloudWatch alarm triggered because a Log Group Metric Filter I have configured captured the error string, forwarded the error metric to CloudWatch Metrics, which triggered the alarm watching that metric (among dozens of other alarms watching other things).
When the alarm triggered it invoked the triage orchestrator Lambda, the investigation started happening, and I got an email about 14 minutes later with a complete root cause analysis and recommendations.
The verdict on Sonnet 4.5
Claude Sonnet 4.5 is clearly able to triage these issues as good or better than Opus 4.1 was, and its a fraction of the cost and runs faster too. All amazing things. The investigation was thorough, the root cause analysis was spot-on, and the recommendations were actionable. This is exactly the kind of capability that can transform how teams handle production incidents.
Instead of waking up to just an alarm that says something broke you wake up to an alarm plus a detailed investigation report that tells you exactly what broke, why it broke, when it broke, and how to fix it, all before you've even had your first cup of coffee at 3 AM because you got paged. The time savings and reduction in stress alone makes this valuable, but the real business impact is in how much faster you can restore service when issues occur.
Claude Sonnet 4.5 gets two thumbs up from me. I'm pleased to release version v0.2.2 of the terraform-aws-module-cloudwatch-alarm-triage project that now defaults to Anthropic's Claude Sonnet 4.5 available through AWS Bedrock. If you haven't tried automated alarm triage yet now's a great time to start, the combination of Bedrock's accessibility and Sonnet 4.5's capabilities makes it easier and more effective than ever.
The full triage report
One thing I noticed about Sonnet 4.5 is its extremely verbose. I mean really verbose, probably too verbose for my taste honestly but I suspect this is intentional to combat the so-called model "laziness" that people complain about. When the model is told to investigate something it really goes all out with the details and the recommendations and the prevention measures. Below is the actual email I recieved from the triage, with sensitive values redacted. Fair warning, its long.
CloudWatch Alarm Investigation Results
======================================
Alarm: [ALARM_NAME]
State: ALARM
Model: [MODEL_ID]
Model Calls: 6
Tool Calls: 5
Region: [AWS_REGION]
Account: [ACCOUNT_ID]
Console: [CONSOLE_URL]
Investigation Files:
• Report: [S3_URI_REPORT]
• Full Context: [S3_URI_CONTEXT]
--------------------------------------
Perfect! Now I have the complete picture with all the details. Let me provide the final comprehensive analysis.
### 🚨 EXECUTIVE SUMMARY
The CloudWatch alarm "[ALARM_NAME]" triggered at 04:50:37 UTC on October 1, 2025, due to an **IAM permission error** in the [LAMBDA_FUNCTION] Lambda function. The function attempted to retrieve email configuration from AWS Secrets Manager (secret: `[SECRET_NAME]`) but was denied access because the IAM role `[IAM_ROLE]` lacks the `secretsmanager:GetSecretValue` permission. This is a **CRITICAL configuration issue** causing 100% failure of all contact form submissions. **IMMEDIATE ACTION REQUIRED**: Add Secrets Manager permission to the Lambda execution role within the next 5 minutes to restore customer contact functionality.
### 🔍 INVESTIGATION DETAILS
#### Commands Executed:
- Retrieved CloudWatch alarm configuration and state
- Located metric filter: `[METRIC_FILTER]` on log group `[LOG_GROUP]`
- Extracted error event from log stream at 04:49:35.279 UTC
- Analyzed complete Lambda invocation context (START, ERROR, END, REPORT)
- Retrieved Lambda function configuration ([LAMBDA_FUNCTION])
- Examined IAM role policies for `[IAM_ROLE]`
- Verified Secrets Manager secret exists: `[SECRET_NAME]`
- Reviewed CloudTrail for recent IAM/Secrets Manager changes
- Analyzed Lambda metrics for invocation patterns
#### Key Findings:
- **Root Cause**: IAM AccessDeniedException - missing `secretsmanager:GetSecretValue` permission
- **Error Time**: 04:49:35.279 UTC (242ms after function start)
- **Request ID**: [REQUEST_ID]
- **Duration**: 281.40ms (function failed quickly)
- **Memory Usage**: 88 MB / 128 MB allocated
- **IAM Role**: `[IAM_ROLE]`
- **Missing Permission**: `secretsmanager:GetSecretValue`
- **Target Secret**: `[SECRET_ARN]`
- **Secret Status**: Exists and is properly configured
- **Impact**: Complete failure - function cannot retrieve email configuration needed to send contact form emails
### 📊 ROOT CAUSE ANALYSIS
**Detailed Timeline:**
```
04:49:35.037 UTC - Lambda function invoked (contact form submission)
04:49:35.279 UTC - Error: AccessDeniedException from Secrets Manager (242ms into execution)
04:49:35.319 UTC - Function completed with error (total duration: 281ms)
04:50:37.782 UTC - CloudWatch alarm triggered (1 minute 2 seconds after error)
```
**What Happened:**
1. **User Action**: A user submitted a contact form on the website
2. **Lambda Invocation**: The [LAMBDA_FUNCTION] function was triggered via API Gateway or direct invocation
3. **Secret Retrieval Attempt**: Function code attempted to retrieve email configuration from Secrets Manager secret `[SECRET_NAME]`
4. **Permission Denied**: AWS Secrets Manager rejected the request with AccessDeniedException
5. **Error Logged**: Function logged the error message with "Error retrieving secret" prefix
6. **Function Terminated**: Function ended without sending email (complete failure)
7. **Metric Filter Triggered**: CloudWatch Logs metric filter detected "Error" keyword
8. **Alarm Activated**: ErrorKeywordCount metric exceeded threshold (0.0), triggering alarm
**Root Cause:**
The Lambda function's IAM execution role (`[IAM_ROLE]`) is missing the required Secrets Manager permission. The function code is designed to securely retrieve email configuration (SMTP credentials, sender/recipient addresses, API keys) from AWS Secrets Manager, but the IAM role was never granted—or had removed—the `secretsmanager:GetSecretValue` action for the specific secret.
[... rest continues for 892 lines total ...]
Yeah. Its thorough alright. The report correctly identified that I had broken the IAM permission on purpose, traced through all the logs and metrics, provided step-by-step fix instructions, and even included prevention measures and monitoring recommendations. Honestly its probably more thorough than most engineers would be at 3 AM which is kind of the whole point.
One thing I need to address though is that the Lambda execution took 14 minutes to complete, which is dangerously close to the 15 minute maximum runtime for AWS Lambda. That verbosity I mentioned earlier comes at a cost in terms of execution time and all those back-and-forth tool calls add up. I'll need to work on optimizing this, maybe by being more specific in the system prompt about brevity or implementing some kind of streaming response instead of waiting for the full investigation to complete. Small steps, but its something that needs addressing before this hits a timeout in production.
Interestingly, Sonnet 4.5's conversational style also required some code changes to the module itself. Unlike Opus 4.1 which would just call tools directly, Sonnet 4.5 tends to add conversational text before making tool calls, saying things like "Let me investigate that" before actually requesting a tool execution. So I had to update the tool call detection logic to search the entire response instead of just the first line. Small adjustment but necessary to work with Sonnet 4.5's more chatty nature.