Data Lake

A central store where all domain events are kept in raw form for replay, analytics, and read model rebuilding.

A data lake is a central place where you store all events from your system in raw, read-only form. Events are facts that happened. You don't modify them.

A common reason to build a data lake is read model migration. When you need to add data to a read model that wasn't there before, fix a bug in how a read model was built, or create a new read model from old events, you need access to the full event history. Without a data lake, that history is gone once messages leave the Pub/Sub.

Data lakes are also useful outside your team. Data teams can query events for reports and dashboards. Data scientists can use them for machine learning. Product managers can explore what happened in the system without asking developers. Since the data is read-only, there's no risk of breaking anything.

A data lake is one of those places where integrating through a shared database is acceptable. Multiple systems can read from it. The trade-off is that your event schema becomes part of the contract. You can't change it without checking that nothing depends on it.

For storage, the choice depends on scale. If you're not expecting more than a terabyte of data soon, PostgreSQL works fine as a starting point. At larger scale, options like BigQuery or Redshift handle huge datasets efficiently. Watch out for premature optimization: it's easy to spend too much time setting up a complicated data lake at the beginning of a project.

architecture patterns

References

Event-Driven Architecture: The Hard Parts — Regarding component tests, it's important to use the same configuration you use in production. Especially if you have something like a single event topic that routes events to some data lake or event log. You should do it the same way in the tests, otherwise you're not sure if you are testing the right thing.