Hacking SaaS #16 - Hard problems with Simple Solutions

Collection of stories from engineering teams who solved difficult SaaS problems brilliantly - with clean and cohesive architectures.

Mar 30, 2023

Or shall we go with base 16 and call it Hacking SaaS 0x10? This makes it a nice round number worthy of a small celebration.

Hacking SaaS, the SaaS Developer Slack, and the SaaS Developer podcast are all community projects by Nile, the database for modern SaaS applications.

Multi-tenant Ingest at DataDog

Datadog wrote about how they built a multi-tenant exactly-once large scale data ingest for Husky, their adorably named event store.

The blog goes into a lot of technical details, but at a high level, they used a really cool trick of taking multiple big and challenging requirements for their ingestion pipeline: Exactly once, multi-tenancy and low latency - and figured out a solution that optimized for all these at once. I love these brilliant engineering insights that lead to cohesive architectures.

Husky’s storage engine is almost completely optimized around serving large scan and aggregation queries [….] This design posed a challenge on the ingestion side: how can we guarantee that data is ingested into Husky exactly once, ensuring that there are never duplicate events? At the same time, Datadog is a massive-scale, multi-tenant platform. Our solution would have to work with our existing multi-tenant ingestion pipelines while keeping ingestion latency reasonable and without blowing up costs. It turns out that these challenges are more related than they first appear. In this post, we describe how we overcome these problems to create auto-scaling, multi-tenant data ingestion pipelines that guarantee exactly once ingestion of every event into Husky’s storage engine.

Airtable - Migrating a Multitenant Architecture to MySQL 8.0

Airtable’s storage team wrote about their experience migrating 100+ shards and millions of tenants to MySQL 8.0. The blog includes a lot of fun details about their planning, testing, tooling and some extra fun bugs that they encountered.

The way they handled the “Upsert” locking bug is especially brilliant - completely avoiding the locking by using a unique property of their application:

Fortunately, we were able to exploit a unique property of our base shard workload. Base shards are responsible for storing base-scoped data, and each base’s operations are serialized through a single NodeJS server process. Conceptually, this means we should never have multiple concurrent read or write operations for base data, so we don’t need the stronger snapshot isolation properties of REPEATABLE READ.

It was so interesting that I invited Andrew Wang, the team leader and blog author to the SaaS Developer Youtube to discuss in more detail. In the video he shared their architecture, discussed different isolation levels and some of their future scalability plans:

Do collaborative applications require CRDTs?

Collaborative apps is one of the hottest trends in modern SaaS, driven in part by the success of Figma. Reading papers on collaborative apps, you may believe that you have to use CRDTs. CRDTs are great, but can be a serious engineering challenge - especially when it comes to troubleshooting.

This blog by Paul Butler explains CRDTs in depth and then shows that most SaaS products can use a simpler and cleaner solution.

So far, our assumption has been that we have a reliable but unordered broadcast channel [….] CRDTs are complex, in both the runtime overhead and cognitive load senses, but in a peer-to-peer environment, this is a necessary cost.
In contrast, browsers are inherently not peer-to-peer. To run an application from the web, you connect to a server. Since the server is centralized anyway, we can have it enforce a global ordering over the events. That is, every replica receives events in the same orer. With this, we can sidestep the need to pay the CRDT complexity tax.

Economy of Scale for Multi-Tenant Applications

Marc Brooker is quickly becoming a community favorite blogger. This week he wrote about how multi-tenant applications (and nearly all SaaS is multi-tenant applications) can have surprising cost advantages once they reach large enough scale. Due to the magic of fixed overhead costs and amortization. It does require infrastructure investment up front, but will pay for itself as the business grows.

If you are curious about the secrets of operating S3 and Lambda scalability, this blog will shed some light on the matter.

Roughly speaking, the cost of a system scales with its (short-term¹) peak traffic, but for most applications the value the system generates scales with the (long-term) average traffic.
[….] multi-tenancy (i.e. running a lot of different workloads on the same system) very effectively reduces the peak-to-average ratio that the overall system sees. This is highly beneficial for two reasons. The first-order reason is that it improves the economics of the underlying system, by bringing costs (proportional to peak) closer to value (proportional to average). The second-order benefit, and the one that is most directly beneficial to cloud customers, is that it allows individual workloads to have higher peaks without breaking the economics of the system.

Building Clickhouse Cloud

Clickhouse wrote about the architecture of their recently launched Clickhouse Cloud. It is very much a classic control-plane/data-plane architecture and their SaaS services (auth, billing, notifications) are part of the control plane.

The blog post starts with their product development process and requirements. After this introduction, it describes their architecture decisions across the entire control plane surface: from k8s networking to authentication and product analytics. As a result or this broad focus, anyone who reads this will learn something new.

Take an hour or two and a beverage of your choice and just read the whole thing. It will be an hour well spent with good ROI.

StackOverflow Annual Developer Survey

If you haven’t yet, start by checking out StackOverflow’s annual developer survey 2022. For me the most surprising insight this year was the gap between technologies popular among those learning to code vs professional developers.

Professionals use Postgres, new coders use MySQL. Professionals use AWS while beginners use Heroku.

Professionals use Typescript a lot more than beginners, and beginners use Python a lot more. Kinda make sense - you need a bit of experience to realize that the extra effort around type safety is more than worth it.

Let us know which insights you discovered in our Slack community.

Hacking SaaS