TL;DR — Dead-letter queues already have everything an AI agent needs: structured errors, message context, retry infrastructure. I gave Claude access to Seq and GitHub, and it triaged 47 dead-lettered messages into 3 root causes and opened a PR. The messaging world solved agentic failure handling decades ago.
Everyone building agentic systems right now is reinventing dead-letter queues. They just don’t know it yet.
The pattern is always the same: an AI agent tries to do something, fails, and the failure needs to go somewhere, somewhere it can be inspected, retried with different parameters, or escalated to a human. Congratulations, you just described a dead-letter queue. We solved this in the early 2000s, and some of us were actually there.
The pattern
In a message-based system, when a message can’t be processed after repeated attempts, it moves to a dead-letter queue (DLQ). From there, you can:
- Inspect the failure to figure out what the hell went wrong
- Retry by sending it back for reprocessing, ideally after fixing whatever broke
- Redirect it to a handler that actually knows how to deal with the edge case
- Escalate to a human, because sometimes that’s just the only option
This is exactly what every serious agentic framework ends up building. An agent calls a tool, the tool blows up, the agent retries, and eventually the failure ends up in some holding area: a retry buffer, a human-in-the-loop queue, an exception log. Call it whatever sounds good in your architecture diagram. It’s a DLQ.
Why messaging got it right first
Messaging systems had to deal with unreliable operations long before LLMs existed. Networks drop, services crash mid-transaction, someone deploys a schema change on a Friday afternoon. The insight was simple: failure is not exceptional, it’s a normal part of the system, so treat it like one.
Dead-letter queues make this explicit by giving failures a destination, a lifecycle, and visibility instead of burying them. Compare that to most agentic systems today, where a failed tool call either disappears into a log file nobody reads or triggers an infinite retry loop that burns through your API budget until someone gets a billing alert.
What agentic systems can learn
If you’re building agents that interact with the real world (calling APIs, writing files, sending emails) stop inventing your own failure handling and just steal from messaging systems:
- Give failures a destination. Don’t just
log.Error()and move on. Route failed operations to a place where they can actually be acted on, because a log line is where failures go to die. - Preserve context. A dead-lettered message carries its headers, body, and exception details, so your failed agent actions should carry the full prompt, tool call parameters, and error response. If you can’t reproduce the failure from the data you saved, you didn’t save enough.
- Make retry trivial. It should take zero effort to pick up a failed action and re-execute it. If your retry strategy involves someone copy-pasting from Kibana into a terminal, you don’t have a retry strategy.
- Set poison message limits. After n failures, stop and escalate instead of letting an agent hammer a broken endpoint until your API bill looks like a mortgage payment.
- Monitor the queue depth. A growing dead-letter queue is a smoke detector, because something systemic is going wrong, whether that’s a changed API, a broken prompt template, or a permissions issue.
The feedback loop
In traditional messaging, dead-lettered messages are mostly a human problem: someone investigates, fixes the bug, replays the messages, and goes back to pretending the system is reliable. But with agents, the DLQ itself can become the input.
An agent can inspect its own failures, reason about patterns, adjust its approach, and retry without waiting for a human to context-switch out of whatever meeting they’re in. That’s what “agentic” actually means: not just calling tools in a loop, but learning from failures structurally. The infrastructure for this has existed for decades, we just need to actually connect it to agents instead of reinventing it badly.
Putting it to the test
Enough theory. I ran an experiment where I gave Claude access to three things: our Seq structured logging server (where all dead-lettered message events end up with full exception details, message types, and endpoint names), the ability to create GitHub issues, and the ability to read the codebase, create branches, and open PRs with fixes.
Seq already captures everything NServiceBus logs when a message gets dead-lettered: the exception stack trace, the message type, the originating endpoint, the number of failed processing attempts, and the message headers. All structured data, and exactly the kind of input an LLM can work with.
This is something people overlook about messaging middleware. Your application code doesn’t have to do anything special. NServiceBus attaches exception type, stack trace, the time of each retry attempt, the source queue, and the original message body as headers on the dead-lettered message. That’s a complete, machine-readable incident report generated automatically, without a single line of application-level error handling. You get it for free just by using the middleware.
And this isn’t unique to NServiceBus. MassTransit, Rebus, Wolverine, Brighter all attach similar error metadata to failed messages, as do the broker-level DLQ implementations in RabbitMQ, Azure Service Bus, and Amazon SQS. The details differ (headers vs. message properties vs. dead-letter reason codes) but the principle is the same: the middleware gives you structured failure context for free, and any of these would work as input for an agent.
What Claude did
I pointed Claude at the Seq query for recent dead-letter events and told it to investigate.
It immediately grouped the failures by exception type and message type. Out of 47 dead-lettered messages, it identified three distinct root causes, not 47 individual problems. Already better than most on-call rotations.
For each group, it queried Seq for surrounding log entries from the same correlation ID, tracing the message journey from initial send through each retry attempt to the final dead-letter event. It read the handler code in the repo to understand what the code was supposed to do, then compared that to what actually happened. You know, the thing we wish every developer did before saying “works on my machine.”
It created three GitHub issues, each with exception details, affected message types, frequency, and a suggested fix. The issues linked back to the specific Seq query so developers could verify the analysis instead of just trusting the robot.
For the simplest root cause, a serialization issue caused by a missing null check, it opened a PR with the fix, referencing the issue and including the Seq query link in the description.
A word of caution
Claude’s first instinct was completely wrong. When it saw NullReferenceExceptions and timeout errors, it wanted to wrap everything in try-catch blocks and add null checks, effectively swallowing the failures. “Problem solved!” No. Problem hidden. That’s the classic developer reflex: the code threw an exception, so let’s make it stop throwing. Great idea. Let’s also fix a leaking pipe by turning off the water meter.
In a messaging system, that’s exactly backwards. A NullReferenceException during message handling often means a dependent service returned garbage or there’s a race condition between handlers. A timeout usually means an external resource was temporarily unavailable. In both cases, the correct response is almost always to retry the message, not to catch the exception and pretend everything is fine. The message represents real business intent: an order, a payment, a notification. Silently discarding failures means silently losing someone’s money or data. Good luck explaining that one.
And whatever you do: never, ever purge messages. Not from the error queue. Not from the dead-letter queue. Not “just the old ones.” Not “just to clean things up.” Every dead-lettered message represents a business operation that someone or something initiated. Purging is data loss, and it’s irrecoverable. Even if the messages look like noise, they might be symptoms of a systemic issue you haven’t diagnosed yet. The whole point of a dead-letter queue is that messages survive until you understand what went wrong. The moment you purge, you’ve destroyed the evidence, and now you get to explain to your stakeholders why 300 orders disappeared.
Once I got this through to Claude, it shifted its approach: investigate the root cause, fix the underlying bug, then let the existing retry infrastructure replay the messages. That’s the right mental model.
What actually mattered
Any developer can query Seq, read a stack trace, and open a PR. Most just don’t, because it’s tedious and there’s always something more urgent. The interesting bit was the closed loop: the dead-letter queue was the trigger, the structured logs were the context, and the issues and PRs were the output. No human had to copy-paste an exception from a log viewer into a Jira ticket, and no one had to manually correlate failure patterns across dozens of messages while fighting the urge to just retry-all and hope for the best.
The dead-letter queue went from being a place where messages go to quietly rot to being the input of an automated investigation pipeline.
The full workflow
That experiment was a single pass, but the real value is in making it continuous:
- Transient errors: tune the retry policy. When the agent sees timeouts, connection resets, or HTTP 503s, the fix isn’t in application code but in the infrastructure configuration. The agent opens a PR that adjusts immediate and delayed retry intervals, bumps concurrency limits, or tightens circuit breaker thresholds, giving the system room to absorb transient failures before messages ever hit the dead-letter queue. Most of these are just configuration knobs that nobody ever bothered to tune properly.
- Actual bugs: fix the code. When the failure is a real bug (a deserialization error, a missing mapping, a logic flaw in a handler) the agent creates an issue with the root cause analysis and, when the fix is straightforward, opens a PR. For the non-straightforward ones, at least the developer picking it up doesn’t have to start from zero.
- Unclear failures: add diagnostics. Sometimes the error and the surrounding logs aren’t enough to tell you what happened. Instead of shrugging and moving on, the agent opens a PR that adds targeted diagnostic logging to the handler, capturing the specific state that would make the failure clear on the next occurrence. Next time it blows up, the logs will actually tell the full story instead of being useless.
- Deployment-triggered retry. When a new version is deployed that includes a fix for a known dead-letter cause, the agent detects the deployment event and triggers a retry of the affected messages. If they succeed, the loop is closed. If they fail again, the cycle continues with fresh diagnostic data. No human needed to remember “oh right, we should retry those 47 messages from last Tuesday.”
The whole thing is event-driven. A message hits the dead-letter queue, the agent picks it up, analyzes the error, reads the code, and opens a PR. From DLQ to bug fix in minutes, not days. No standup required. No Jira ticket sitting in a backlog while someone “gets to it next sprint.”
The only human step is the PR review, but even that boundary can move. If the agent can classify a fix as non-destructive (say, a null check on an optional field, or a retry policy tweak) there’s no reason it couldn’t trigger an auto-deployment through CI. The fix ships, the deployment event fires, the agent retries the affected messages, they succeed, done. From dead-letter to resolved, fully automated, while you were getting coffee.
This entire workflow assumes two things that any well-designed messaging system should already guarantee: idempotent message processing and the ability to handle at-least-once delivery. If retrying a message causes duplicate side effects, you have a far bigger problem than dead-letter queue management. Idempotency is not optional, it’s what makes the retry safe, whether it’s triggered by the infrastructure, a human, or an agent. If your handlers aren’t idempotent, fix that first. Everything else is built on sand.
Learning from humans
The agent’s first PR for a given failure type is going to be mediocre, and that’s fine. What matters is what happens after a human reviews it.
When a developer takes the agent’s auto-generated PR, rewrites half of it, and merges their version, that’s training data. Not in the ML fine-tuning sense, but practically: the agent can diff its proposed fix against what actually got merged and see what the human changed and why. Did they use a different error handling pattern? Did they add a retry policy instead of a null check? Did they refactor the handler to avoid the problem entirely instead of patching around it?
Over time, the agent builds up a history of “here’s what I suggested, here’s what the team actually shipped.” That’s a feedback loop on the feedback loop. After a few dozen resolved incidents, the agent has seen your team’s coding style, your preferred error handling patterns, your conventions for retry policies and circuit breakers. Its PRs start looking less like generic LLM output and more like something your senior dev would write on a good day.
The trick is simple: before generating a fix, the agent queries GitHub for past merged PRs that touched the same handler or the same message type. It reads the diff, the review comments, the linked issue. Now it has context that no amount of prompt engineering can replace because it knows how your team fixes things in your codebase.
Correlating with infrastructure telemetry
Seq logs tell you what failed but they don’t always tell you why. A TimeoutException in a handler could mean the code is slow, the database is overloaded, or the entire node is getting hammered, and the stack trace looks the same in all three cases.
This is where OpenTelemetry and Prometheus come in. Give the agent access to your metrics (CPU utilization, memory pressure, request latency percentiles, database connection pool saturation, message throughput per endpoint) and suddenly those ambiguous timeouts have context. The agent can correlate the dead-letter timestamp with a CPU spike on the processing node, or a connection pool exhaustion event, or a garbage collection pause. Now it knows the handler isn’t broken, the infrastructure was having a bad moment, and the fix isn’t a code change but scaling the node or tuning the connection pool.
The agent can also spot patterns that humans miss because nobody has time to cross-reference five dashboards. “Every Thursday between 14:00 and 14:30, this handler starts timing out. CPU on node 3 spikes to 95%. That correlates with the weekly reporting job that runs on the same cluster.” Good luck finding that in a Grafana dashboard while you’re also dealing with three other issues.
Detecting regressions
The agent already knows which deployments went out and which PRs they contained, and it knows the historical dead-letter rate per handler. So when a new deployment goes out and the dead-letter rate for HandlePaymentCompleted jumps from zero to fifty messages in ten minutes, it doesn’t just notice, it correlates.
It can diff the current deployment against the previous one, identify which commits touched the affected handler, and cross-reference with past issues. “This handler was last dead-lettering in v2.3.8 due to issue #127, which was fixed in PR #131. The current deployment includes PR #156 which modified the same handler. The exception type matches issue #127.” That’s a regression, and the agent didn’t just detect it, it traced it back to the specific PR that reintroduced the problem.
Combined with infrastructure metrics, the agent can also distinguish between a code regression and an infrastructure regression. Did the dead-letter rate spike because someone shipped a bug, or because the database failover kicked in and half the in-flight messages timed out? These are very different problems with very different fixes, and an agent with access to both your deployment history and your Prometheus metrics can tell them apart immediately.
The endgame is an agent that, upon detecting a regression, automatically opens an issue that says: “Dead-letter rate for HandlePaymentCompleted increased 50x after deployment v2.5.0. Likely caused by PR #156 (cc @developer). Exception pattern matches previously resolved issue #127. Infrastructure metrics are normal, this is a code regression, not a capacity issue. Suggested action: revert PR #156.” And if you’ve set up the auto-deployment pipeline from earlier, it could even do the revert itself, before anyone’s pager goes off.
At that point you’ve built an SRE that doesn’t sleep, doesn’t get context-switching fatigue, and has perfect memory of every incident your system has ever had. Not a replacement for your team, but a hell of a first responder.
Stop reinventing this
If you’re building agentic systems, stop reinventing failure handling from scratch. The messaging world figured this out ages ago: dead-letter queues, poison message detection, retry policies, error queues. These patterns map directly onto the problems agents face today. You don’t need a PhD in “agentic architecture” to implement this, just a message broker, some common sense, and a reasonable API budget.
And if you already have dead-letter queues and structured logging, you’re closer to an agentic feedback loop than you think. Give an agent read access to your logs and write access to your issue tracker, and the dead-letter queue stops being a graveyard and becomes a backlog that investigates itself.