Hacking SaaS #12 - Collection of Data Architectures
In which I share interesting blogs, papers and podcasts with solid data architectures, real-time and multi-tenant.
Welcome to 2023, the year in which we’ll all build and scale SaaS products! And let’s kick it off with some data architectures.
Real Time Data Apps
Last year Uber published a paper on Real-time Data Infrastructure describing 8 years of work building and scaling the real time data platform that powers all of Uber (including UberEats). The paper focuses on the many use-cases that fall under the “real time data applications” umbrella and how their requirements are often conflicting. For example:
For instance, dynamic pricing for a given Uber product (such as rides or eats) is a highly complex real-time workflow... This system is designed for favoring freshness and availability over data consistency, and it’s implemented entirely by engineers. On the other hand, monitoring real- time business metrics around orders and sales requires a SQL like interface used by data scientists with more emphasis given to data completeness.
Uber’s actual architecture is probably overly complex for almost anyone who isn’t Uber, but their map of use-cases and requirements is a must-read for anyone thinking about real-time data.
The use-cases Uber analyzes in the use-cases section are external - products that drivers, riders, couriers and restaurant managers use. Compare this to an older paper on Facebook’s real-time data pipelines. Facebook paper focuses on internal use-cases and has fewer requirements - namely Freshness and Cost are missing from Facebook’s paper. I think it shows how our use of real-time data matured over 5 years.
Since most of us are not Uber-scale, but many of us still want real-time data in our applications, we need to figure out a plan that starts with a simple use-case and simple architecture and evolves as we grow. For example, counting how many messages are sent from each IP every day:
With Redis and Kafka we were able to quickly build a system that was able to count hundreds of millions of unique keys. More importantly, this architecture let us “throw money at the problem” for a while as we scaled other parts of our infrastructure.
Eventually, when costs became an issue we migrated Redis to RocksDB and added Kubernetes. We were able to save thousands of dollars a month and reduce the RAM requirements of our application by about 1 TB.
Multi-tenant Data Architectures
Another oldie-but-relevant bit is this podcast, with full transcript, in which a Salesforce engineer talks about their multi-tenant architecture. Salesforce is the original SaaS, so their journey is interesting, and they also compared the Salesforce model to Heroku - which is a very different type of SaaS product. So you get some Infra SaaS (Heroku) and some Application SaaS (traditional Salesforce). They talk about isolation, security, utilization, operability, cost, scaling up and down… all the good stuff.
If you're talking about ephemeral compute, you can scale that up and scale that down pretty easily, right? You can just even all the way to using like a serverless approach where somebody else worries about scaling it up and down for you. But from a storage perspective, databases and things like that, if you've got separate resources actually physically spun up for all of those, it gets really expensive really fast.
So just the impact of that on the cost structure of the service but then, I mean, think about also just even on the environment and things like that, there's just a ton of waste there. That's why for the majority of services that Salesforce runs, that's why we run it in that shared resource mode.
Now of course, it takes a lot more work to build the software in such a way that it's going to work. But then once you've done that, you have that as an option.
All SaaS have tenants, but there are many different options to manage multi-tenancy. As you’ve just seen, it can be different for compute vs storage. It can be different if you are targeting startups vs enterprise. And it can be different in the control plane, the data plane and the application plane. AWS wrote a great paper that summarizes some of the fundamental decisions and concepts in SaaS architectures.
The terms multi-tenancy and SaaS are often tightly connected. In some instances, organizations describe SaaS and multi-tenancy as the same thing. While this might seem natural, equating SaaS and multi-tenancy tends to lead teams to take a purely technical view of SaaS when, in reality, SaaS is more of a business model than an architecture strategy.
This is both funny and encouraging to read! I’m often worried that too many teams take a purely business approach, to the point of not having engineers on the team at all. This was one of the reasons we created the SaaS Developer community - because no one talked about the technical problems in SaaS, just about the business problems.
Assuming your SaaS application is successful, you’ll need to scale it. If you opted for multi-tenant or “pooled” data layer, scaling will likely involve some sharding. Notion published a truly amazing blog on their migration to a sharded database.
They shared what drove their decision to shard:
For us, the inflection point arrived when the Postgres
VACUUM
process began to stall consistently, preventing the database from reclaiming disk space from dead tuples. While disk capacity can be increased, more worrying was the prospect of transaction ID (TXID) wraparound, a safety mechanism in which Postgres would stop processing all writes to avoid clobbering existing data. Realizing that TXID wraparound would pose an existential threat to the product, our infrastructure team doubled down and got to work.
The choice of partition key is relevant because the way I read this - Notion Workspace is basically a tenant. Sharding by tenant is a standard model for scaling SaaS:
Each workspace is assigned a UUID upon creation, so we can partition the UUID space into uniform buckets. Because each row in a sharded table is either a block or related to one, and each block belongs to exactly one workspace, we used the workspace ID as the partition key. Since users typically query data within a single workspace at a time, we avoid most cross-shard joins.
Go read the entire blog though - they share many important details, and it is pretty funny too:
You may be wondering, "Why 480 shards? I thought all computer science was done in powers of 2, and that's not a drive size I recognize!"
There were many factors that led to the choice of 480:
2
3
4
5
6
8
10, 12, 15, 16, 20, 24, 30, 32, 40, 48, 60, 80, 96, 120, 160, 240
I’m still giggling every time I look at this admittedly silly joke.
And finally, over on the SaaS Developer channel I explained what people mean when they talk about “separating compute and storage” in database systems. You can also read the explanation on Nile’s blog “Compute-Storage Separation Explained”:
That’s it for the week!
Great point!
How you build your SaaS / your multitenancy model is a decision that often underrated.
Choosing the right model (which depends on so many factor such as who are your customers, how sensitive the data you work with, how much data you process per customer, etc) really is hard problem.