Event Sourcing for SaaS applications

and other interesting conversations in the SaaS Developer Community

Jun 29, 2023

There were a lot of good conversations in the SaaS Developer Slack in the last few weeks. If you haven’t joined yet, you are missing out on learning from the best SaaS developers and founders.

And if you are in the SF bay area, don’t miss our first in-person meetup. We have great presentations planned about adding generative AI features to SaaS apps. Frank Greco, Head of AI at Superblocks, will share how Superblocks uses AI to transform the Developer Experience. Antje Barth and Chris Fregly from AWS will share stories and advice based on their experience working with hundreds of developers building AI products.

I summarized one particularly interesting conversation from the SaaS Developer Slack this week. In addition, I linked to many other interesting Slack conversations. Let me know what you think of this new format.

Event Sourcing for SaaS Applications

Moshe Eshel asked:

I've been seeing event sourcing all around for years, but have never seen an actual production implementation. Seems like a neat idea but can't figure how a business application actually functions this way

Lucas Stephens explained the use case but warned that it ain’t for everyone:

Event sourcing is typically paired with CQRS - you have two data stores: one that can be optimized for writes and one that can be optimized for reads.
As an example, at my current company, we offer a "proxy-workflow engine" as a service: so we proxy a significant portion of our customer's traffic & need to operate on that traffic quickly without impacting latency. We have two separate data stores as a result - the write-optimized data store, which is very difficult to query but generally has 10-20ms inserts, and the read-optimized data store, which is treated as eventually consistent & populated by a data pipeline, and is much more usable for analytical queries and is what the UI of our application points to.
With event sourcing, your event store essentially becomes the write-optimized store in this system. Then you build read-optimized projections of that data in separate data stores - this typically means writing a stream processor to populate those new stores. The obvious trade-off here is that you do double the work of maintaining two databases. I don't recommend CQRS or event sourcing unless you absolutely need it. Personally, I think the main benefit of event-sourcing (auditability) can still be achieved with your traditional RDBMS model.
we don't do event sourcing at my company, only CQRS - the concepts are closely related but separate). The hard part of event sourcing is that you're putting all of your business logic into a stream of events, which is cognitively harder to think about and change - I've seen event sourcing fail more than it succeeds, so I'm a bit biased.

But Moshe was looking for something more specific:

I'm looking specifically for a working example of event sourcing as the root of the architecture (exactly as you explain in the second part). It's not only consuming streams and processing to populate a data store but also the entire application running on top of a stream - constantly scanning it from start to ... (endless?)

So Lucas went into more details:

You could model pretty much any system as a stream of events; you just have to decide when and if it's appropriate.
A specific case where I've seen it done (and it ultimately failed) was when I worked at a life insurance tech startup. We modeled the entire flow of initiating & managing a policy as a stream of events, so every single creation/modification/deletion of the policy itself became something like PolicyHolderAddressChanged or PolicyBeneficiaryAdded instead of calls to a generic UpdatePolicy endpoint. To query for the current policy, services would either have to get all events from the event store, order them, and compute the state in memory or query a "projection" showing the current state. We tended to opt for the projection route for performance. With a few exceptions, mostly everything is communicated via publishing an event to a queue. We chose event sourcing because we knew we needed audited history for the policies. We built & modeled that entire system using protobufs for the event definitions.
Why I ultimately consider it a failure (this startup recently shut down at the beginning of the year) is that it simply took way, way, way too much time to engineer the system like this. And from a compliance perspective, we would have been okay simply creating triggers on our database tables that inserted versions of rows into some "history" table. We had real problems when we needed to add new events to the domain or update the schema of existing events - which are inevitable changes in any system. This meant that every single projection needed to be replayed & rebuilt, and this was especially expensive for the projections we built for our analytics team, as they needed the most generic view of the data.
As our data grew, this only became more difficult. I do remember we ran several event-storming sessions to model the business before we started eng work, so if you're thinking about it, it might be useful to do this exercise.
Going into the project, I wasn't against the idea, but now that I've seen how the sausage is made, I'll always be a bit scarred from it. What is really ironic to me is that it seems most appropriate for systems with infrequent writes (fewer events, less need for projections), but the added complexity and extra work are especially not worth it in those situations. It can be argued that an unstructured event store (NoSQL) is easier to scale than your traditional RDBMS - but there's so much exciting stuff happening in more "horizontally scalable" SQL databases like Vitess, Cockroach, and Clickhouse that I don't feel compelled ever to choose event sourcing.
Not to mention, event sourcing is done so sparingly there's not a lot of open-source tooling or experience amongst devs that can help you out with things like replaying events, building projections, etc. Whereas there is a ton of tooling for the traditional RDBMS model

Worth noting that from my (Gwen) conversations with many startups, it is very common to start with exactly the simple pattern that Lucas described - triggers and a history table. It scales surprisingly well, especially if the history table is designed correctly.

Lucas’ experience convinced Moshe, and there was agreement all around.

But then Daniel Chaffelson entered the conversation:

I've seen event sourcing done in payments processing - specifically servicing the 'instant payments' standard where latency is <2s. It makes sense for this to be an event-driven system because its high volume, transactions have a limited number of pathways through the process, generally live for a very short amount of time, and tables can be materialized of the various outputs over the streams as they happen.
Ultimately though, this is just a core feature of a larger banking system, not the entire system itself, so I don't know if it really answers your question. I guess you could ask what % of a system needs to be event sourced for it to be the main architecture.
Also, I would agree if you were thinking you need event sourcing because of some kind of scale problem, technology like ClickHouse (and others as mentioned) is solving a lot of this kind of thing without making your architecture structurally complex. We operate CH in a use case over billions of daily rows with a latency that pegs under 500ms at a couple of hundred qps on average that can spike from 5-20x that on a busy day. It requires operational finesse to maintain, but the architecture isn't complex and development over it is quick and easy.

Moshe pointed out the distinction between event sourcing and event-driven architectures. He also highlighted how both Daniel’s examples cleverly worked around the key disadvantages of event sourcing while enjoying many of the benefits:

It sounds like two separate cases.
The first you described sounds like event streaming, but on a small time scale (so there is no persistence of events or, rather, events have a short TTL). Indeed the operation and state are done/processed as stream always: Queries are aggregating the events, and there is no other data in the system - views are temporary and either aggregated in place or recreated. This is a very valid use case. Thinking of the tradeoffs, it throws away the biggest problem with event-based (data accumulation and how to deal with it) and enjoys all the benefits (one example is any stream processor that might use time windows and such).
The second part sounds to me like Event Driven architecture. Communication is done via events, but they just drive the state, which is managed internally by each component/service (in a DB?) according to their preference. It isn't event streaming. I like this architecture because it again preserves the advantages and "throws away" the problem: Events are easy to pass and provide a rich and contextful interface, but we don't have to aggregate many small pieces every time we want an answer. Instead, we use SQL/Whatever on a strong Data Store.
You can still store the events aside and replay them if needed. For example, when you want to correct something. This is a nice backup strategy that can help deal with data corruption due to a bug in handling events. I built such a system in a previous company. Sometimes our processing logic was changed due to either a bug or a client request to use a different aggregation rule. and we could delete the stored data, replay the original events, and recalculate - which was heavy, but our support loved it!

Daniel agreed and added interesting details:

I agree. It goes back to your original point that there are a lot of streaming use cases around, some on a truly massive scale, but very few actual event-sourcing architectures.
Also, perhaps it's useful to observe that the first case was building a payments system from the ground up to be a streaming ledger, whereas typically they are migrating from a mainframe or other database-as-ledger approach. And the second is taking a problem that the existing RDBMS couldn't scale to, and reworking it into a Kafka + OLAP solution.
Replay and archiving to cold storage, amongst other things, are part of what make the managed service so valuable as you note. We find customers get a lot of value from our data engineers optimizing their queries specifically for CH performance, which would otherwise require a larger and more expensive internal team if every customer had to do it for themselves, and would be probably an order of magnitude more expensive if you had to hire the unholy trinity of Kafka + Flink + Cloud developer to service it.
Ultimately I think if event sourcing was such a great solution to particular problems, 'streaming is the answer, what was the question' companies like Confluent and StreamNative would have more use-cases for it on their homepage

Moshe circled back with good news about the project that started the discussion: Event sourcing was proposed as a high-level futuristic idea, but didn’t seem to be a likely future direction for their product architecture.

More good Slack conversations:

Unfortunately, I don’t have the bandwidth to summarize every great discussion here (ping me if you want to volunteer?). So here are links to other great conversations. They will be gone by September, so grab them while you can.

Colt started a conversation about the incentives of cloud vendors to contribute to K8s and other OSS projects. Mitch, Aaron Kimball, Lucas Stephens, Buchi Reddy, and Moshe Eshel all shared interesting viewpoints.

Lucas shared the Kubernetes documentary, which I haven’t watched yet, but looks like a must-see:

This forked off to two conversations about (what else) cloud pricing:

Aaron Kimball made the case that cloud vendors sell mostly RAM and networking since CPU and disk space are rarely the bottlenecks. Shikhar’s workloads turned out to be CPU-bound. And in another thread, Moshe Eshel made a convincing case that cloud vendors don’t sell resources at all. They charge for resources, but they sell elastic capacity - the ability to get 1000s of machines in a click.

Our first strategy was to test out services on AWS, and once we learn the workload pattern we can cost optimize by moving to our DC. However, the capacity and stability eventually turned to moving everything to the cloud...

Jeffrey Sherman asked Do you see yourself more as someone who provides a service or someone who writes software? Moshe Eshel, Colt, Daniel Chaffelson, and Aaron Kimball all had great stories that illustrate the difference. The community is in wild agreement that developers solve problems and deliver value. Writing code is just a part of the job.

Moshe Eshel asked if anyone has experience with Yugabyte. Rauan Mayemir shared what he learned while investigating their solution, Daniel Chaffelson shared thoughts on the space and Colt added internal architecture information (and admitted that his friendship with Yugabyte leadership makes him biased).

Also, check out questions and conversations about alerting , EDR, customer dashboards , cloud marketplaces, and push vs pull in control plane architectures.

Did I mention that if you didn’t join the SaaS Developer Slack you are missing out ?

Hacking SaaS

Discussion about this post