Hacking SaaS #8 - Uptime, Tenants and Security
Lessons from DynamoDB, Workday, a Scandinavian media company and more.
This newsletter/blog now has over 1500 subscribers. This is pretty exciting - I hope you all find it useful and entertaining. Let me know if there’s anything I can do to make it better for you. And now, to the content:
DynamoDB has been around for 10 years. I’m old enough to still think of it as that new DB, but I have to admit that it grew up to become one of the most impressive managed databases. 99.999% SLA. Their last significant outage happened in 2015. Multi-tenant with consistent single digit ms latency at p9999. Global replication. Auto-scaling. In our DB Survey, most people say they use DynamoDB because it is completely ops-free.
Truly the gold standard for Infra SaaS. I’m sure many of us want to know how they did it! Luckily, they published a paper about the lessons they learned in the ten years they’ve been operating the service at scale. They also published a short summary in a blog.
The paper captures the following lessons that we have learnt over the years
• Adapting to customers’ traffic patterns to reshape the physical partitioning scheme of the database tables improves customer experience.
• Performing continuous verification of data-at-rest is a reliable way to protect against both hardware failures and software bugs in order to meet high durability goals.
• Maintaining high availability as a system evolves requires careful operational discipline and tooling. Mechanisms such as formal proofs of complex algorithms, game days (chaos and load tests), upgrade/downgrade tests, and deployment safety provides the freedom to safely adjust and experiment with the code without the fear of compromising correctness.
• Designing systems for predictability over absolute efficiency improves system stability. While components such as caches can improve performance, do not allow them to hide the work that would be performed in their absence, ensuring that the system is always provisioned to handle the unexpected.
Reaching 99.999% SLA is a matter of continuous investment and improvement over many years. Having an availability goal and working toward it helps align engineering team and the business on requirements, efforts and expectations. SaaS Developer Community member, Ken Finnigan, leads the SLO practice at Workday. His job is to implement industry best practices across engineering teams, and he shared his advice with us. He is happy to discuss the topic more on the SaaS Community Slack, so join us!
A short and to the point blog from someone who ran two services each with a different multi-tenant model - and compared the experiences. There are many papers written about selecting multi-tenant models and the pros and cons of each, but there’s something special about a blog that talks about a specific experience in concrete terms. And some problems are truly challenging, no matter which route you choose:
Since we are talking about taking services down, let’s talk a little bit about recovery. How many of you have truly fully automated recovery procedures in place? Many things can go wrong when deploying a new version of a service: There can be a bug in the application code, a mistake in the database schema migration, or a mismatch in the DNS configuration. Quite often something unexpected happens, which you are not fully prepared for. In such a scenario, you need to take manual action to fix the problem. Would you like to revert the change on one database or on a number of tenant databases?
I recently discovered the Cloud Native Security guide from K8S. I love their defense-in-depth approach, but am bothered by how they left out any sign of a data store. Do they assume that Cloud Native applications don’t store data? Isn’t the data store one of the most important places to apply security?
Each layer of the Cloud Native security model builds upon the next outermost layer. The Code layer benefits from strong base (Cloud, Cluster, Container) security layers. You cannot safeguard against poor security standards in the base layers by addressing security at the Code level.
I recently participated in a panel of SaaS Experts on the topics of… you guessed it - multi-tenancy and security. It was a great discussion that covered technical topics as well as business implications. The tenant model often needs to take into account various service tiers and provide different guarantees at different tiers.
We all enjoy a good detective story, even if it isn’t directly related to SaaS. So take a cup of tea and enjoy this story of an engineer who tried to figure out a connection leak. If you know TCP well, you’ll enjoy trying to guess who done it. If you don’t know TCP well, you will know a lot more after reading this. And the conclusions apply to almost everyone who ever ran a service:
The Linux networking subsystem is complex. There is no way to fully understand this simply by reading books. You have to get your hands dirty with bugs like these to sharpen your understanding.