What We Got Right, and What We Got Wrong, Building an AI Agent Bus

We open-sourced the AWS AI Agent Bus. People use it. We used it. We get messages from teams running it and need help.

This is a honest post about it all, and what we learned.

The Part That Worked

The MCP server is good. We stand behind the work and the capabilities.

The goal: give AI assistants a standardized and usable internal interface to AWS. DynamoDB for persistent key-value storage. S3 for artifacts. EventBridge for event coordination. A timeline system for logging what agents do and when. Two transport options, stdio and HTTP, depending on how you want to connect.

MCP is the right tool for this job. It does what a tool protocol should do: expose a capability, accept a call, return a result. The contract is clean. Claude knows how to use it. When an agent needs to write to DynamoDB or fire an event to EventBridge, the MCP server handles it. That layer works.

Not knowing more deeply how MCP works, we’d build it the same way again.

The Part That Was a Mess

The orchestration layer is where we went wrong. The MCP context tipped over in model runs. Adding things like retry queues and concurrency management patterns became a job on its own. When we built it, MCP was a hit and lots of people were learning about it, including us (and Anthropic apparently).

We built a multi-agent system on top of the MCP server. A Conductor that set the plan. Critics that checked for safety and approval. Specialists with domain knowledge in React, Django, and Terraform. A memory system combining KV storage, timeline history, and vector embeddings. It looked good in the README.

In practice, it was brittle.

The roles were fiction. Conductor, Critic, Specialist — these were prompt engineering dressed up as architecture. Nothing structural separated what the Critic could access from what the Specialist could access. We had one agent telling another what to do through natural language, and the “other agent” was just Claude with a different system prompt. The coordination was vibes.

Shared memory created a coherence problem. We gave agents access to the same memory system so they’d have context. When agents share memory without scope constraints, they act on each other’s context in ways we didn’t intend. An agent would pull timeline history from a previous workflow step and factor it into a decision it had no business making. Debugging that was a nightmare.

There was no trust boundary. The assumption is the agent bus is an internal tooll security was simply a matter of using what AWS provided. Any agent in the system could write to the KV store, fire events to EventBridge, or update the timeline. IAM policies did some of the work, but nothing at the application layer enforced which agent could take which action at which point in the workflow. We were one prompt injection away from a bad day.

Human-in-the-loop was an afterthought. We said the system had human-in-the-loop capabilities. Technically true. We implemented it as a pause-and-poll pattern: an agent would write a pending-approval record to DynamoDB, and something would eventually check it. That works in calm conditions. It is not a control plane. A human couldn’t see workflow state. They couldn’t intervene mid-step. They could approve or reject a completed artifact. That’s not the same thing.

What We Learned

The Agent Bus forced us to get specific about what multi-agent coordination actually requires. Not features that sound good in a README. Structural properties that make agents safe to run together.

Context must be scoped. The orchestrator needs the full picture. Downstream agents don’t. Each agent operates on a mandate: here is your task, here is the information you need, here is nothing else. When we violated this, we got coherence problems. When we enforced it manually, things worked.

Authority must be time-bound. An agent authorized to act an hour ago should not still be acting on that instruction. We had no TTL concept on a capability. Stale agents taking actions on stale context is a real production problem.

EventBridge is infrastructure, not a protocol. We kept reaching for it as a coordination backbone. It delivers your event. It won’t tell you who was authorized to send it, whether the receiving agent has the mandate to act on it, or whether the action has an audit trail that lives outside the platform.

Receipts belong to principals. When an agent takes an action, a step completes, or a human approves, that record should live with the organization that initiated the workflow. We stored it in our DynamoDB table. We had observability. We didn’t have accountability.

Where This Led

The problems kept pointing at the same gap: no governance layer between the MCP server and the application logic. Every team building multi-agent systems fills that gap differently, in their application code, inconsistently.

That gap is what Principal Agent Protocol is built to fill.

PAP puts governance at the protocol level. The orchestrator holds full context. Downstream agents operate on scoped, cryptographically bound mandates with TTL-based decay. Capability tokens are single-use. Receipts are co-signed and held by the principal. Trust is structural.

The MCP server from the Agent Bus sits cleanly below PAP. It does what it does well. PAP handles what the Agent Bus orchestration layer failed to handle, without requiring every team to solve the same problems in their own application code.

What We Build Now

Everything we build runs on PAP. Every client engagement. Every internal tool. The MCP server still handles what it handles well. PAP handles everything above it.

We’re not waiting for the ecosystem to catch up. The problems the Agent Bus exposed are real, and every team building multi-agent systems hits them. Most solve them in application code, inconsistently, until something breaks in production.

You don’t have to do that. PAP is open source. The specification is published. The Rust implementation is available. Start with the same foundation we use.

What We Got Right, and What We Got Wrong, Building an AI Agent Bus

The Part That Worked

The Part That Was a Mess

What We Learned

Where This Led

What We Build Now

Like this:

Leave a ReplyCancel reply

The Part That Worked

The Part That Was a Mess

What We Learned

Where This Led

What We Build Now

Share this:

Like this:

Leave a ReplyCancel reply