Hacking SaaS #9 - Few Good Control Planes
Snowflake published a paper with their control plane architecture, Clumio published a 30s video, Nile blogged an example, LinkedIn open sourced Venice and we found the perfect blog for learning CORS.
18 month ago, when Ram and myself started the SaaS Developer Slack, our goal was to create a space for learning. We wrote: Our community is a space to discuss problems and share learnings about building SaaS companies and products.
Architecture papers from companies with strong track record can massively accelerate the learning of everyone in the space. Think how many companies created more scalable architectures after Amazon published the legendary DynamoDB paper.
✨ This background may help explain my excitement about Elastic Cloud Services: Scaling Snowflake’s Control Plane. The paper is readable but very dense with interesting ideas, so you can’t really skim it. If your time is short, you can get a lot of value from reading just a single section - on replication for instance, or throttling. Just the abstract shows how much ground this paper covers:
In this paper, we describe the design and operation of Snowflake’s Elastic Cloud Services (ECS) layer that manages cloud resources at global scale to meet the needs of the Snowflake Data Cloud. It provides the control plane to enable elasticity, availability, fault tolerance and efficient execution of customer workloads. ECS runs on multiple cloud service providers and provides capabilities such as cluster management, safe code rollout and rollback, management of pre-started pools of running VMs, horizontal and vertical autoscaling, throttling of incoming requests, VM placement, load-balancing across availability zones and cross-cloud and cross-region replication. We showcase the effect of these capabilities through empirical results on systems that execute millions of queries over petabytes of data on a daily basis.
And an example of a good idea that is literally a single sentence. Blink and you missed it:
The replication mesh is using specific cloud service provider abstractions to achieve data replication for each cloud provider. For example, in AWS deployments the replication mesh operates over Replication S3 buckets distributed across Snowflake accounts.
✨ Snowflake’s paper paper links to a 2019 paper from UC Berkeley that lays out a vision for Serverless cloud infrastructure and the challenges our industry is facing. If you want to know the future and don’t mind reading about yet unsolved problems, it is highly recommended.
✨ And since we are on the topic of papers. I was super excited to discover that the motivating example in Google’s F1 paper is a SaaS product - Google AdWords. Re-reading the paper with SaaS challenges and data models in mind was an eye opening experience.
✨ If you want to hear about other people’s control planes - but quickly, Clumio and AWS published a 30s video snippet where Clumio explains their architecture. It has a 3-tier architecture with a multi-tenant SaaS control plane, a customer-specific control plane and a data plane on the customer account. Bill Tarr from AWS’s SaaS partnership team shared this video as a teaser for his upcoming Re:Invent talk.
✨ Nile 1 published a walk through of an MVP control plane for a fictional Infra SaaS company. The blog goes over features that are needed for launching Infra SaaS product and alternatives for implementing them.
Walking through the infrastructure SaaS workflow of provisioning an ETL pipeline highlighted some complex problems that need to be solved, how to:
provide a database as a source of truth with built-in multi-tenancy
give developers an event service to reconcile with the data plane
serve up metrics for consumption-based billing, experimentation, and other business operations
authorize users with a flexible access control model
provide great UIs and API along with a slick frontend with web components customized to the backend
✨ SaaS requires full-stack knowledge, which includes web protocols, standards and browser-fu. One of the most interesting things I learned recently is CORS - a 2014 standard that allows JS code in the browser to make requests to 3rd party services. Without CORS, the API economy wouldn’t be the same. If you are a platform engineer, APIs are your product, and understanding CORS is key to make sure your APIs are both secure and usable from frontend code. What is CORS blog does a good job of both explaining the basics and covering every single gotcha that I’ve seen teams run into.
✨ And… new on the SaaS developer channel! Our nerdiest episode ever. 🤓 LinkedIn open-sourced Venice, and I had a great time talking to Felix GV, the lead architect, about the use-cases and all the interesting architecture choices that he made. Highly recommended to watch this and maybe few more Venice talks - while Venice is still a bit new as a project, the wide range of use-cases it covers makes it a DB-to-watch in my book.
Disclosure: I’m a co-founder of Nile