Distributed System

A system where multiple services work together, communicating over a network.

A distributed system is a group of services that coordinate over a network to achieve a common goal. Instead of one process doing everything, the work is split across multiple services that communicate through HTTP calls, gRPC, or messages.

It allows scaling parts independently, deploying without taking down the whole system, and lets teams work on separate services. But the network between services introduces problems you don't have in a monolith. Calls can fail, messages can arrive out of order or more than once, and there's no shared transaction to keep everything consistent.

These challenges are not edge cases, but the default state. Your code needs to handle at-least-once delivery, eventual consistency, partial failures, and retries. You need tracing to understand what's happening across services, and correlation IDs to follow a request through the system. Patterns like the saga, the outbox, and circuit breakers exist because distributed systems fail in ways that monoliths don't.

Before splitting into multiple services, consider whether you need to. A well-structured monolith with clear bounded contexts handles a lot of complexity with much less operational overhead. If two things always change together, they probably belong in the same service. Start with a monolith and extract services when you have a real reason: independent scaling, separate team ownership, or different deployment cadences.

architecture infrastructure

References

Distributed Transactions in Go: Read Before You Try — Discusses how to simplify distributed systems using events and eventual consistency instead of distributed transactions. Shows that working with facts (events) removes the need for rollbacks and distributed locks.
The Over-Engineering Pendulum — Warns against swinging between extremes: sticking with a massive monolith forever is as extreme as starting with a distributed system. Advocates for modular monoliths with the option to extract services when needed.
Shipping an AI Agent that Lies to Production: Lessons Learned — Draws a parallel between the microservices hype and the agentic AI hype. Many teams learned the hard way that distributed systems are difficult to get right, even with perfectly predictable code.
Event-Driven Architecture: The Hard Parts — Shares hard-learned lessons from building distributed systems with events, covering pitfalls like dropped messages, debugging, eventual consistency, and designing events.
The Distributed Monolith Trap (And How to Escape It) — Discusses the complexity of distributed systems and whether all that complexity was really needed. Reflects on jumping on the hype train and how some parts could have been simplified.
AMA #1: Clean Architecture, Learning, Event-Driven, Go — Covers distributed system timeouts and async communication. When dealing with chains of service calls, consider using messages instead of increasing timeouts.