Hacking SaaS #33: Much AI about Something
Catching up with AI content is difficult, but here are a few things I think you shouldn't miss. And some non-AI content as well.
As you may have noticed, AI is exciting again. I got caught up in the excitement, and published a few examples (code assistant blog and video, smart todo list video, sales insights video). As often happens, once you start getting into a new area of interest, it is like this rabbit hole - you keep digging deeper and deeper and learning new things all the while. AI has been like that, except it is also moving fast - as if the rabbit hole is trying to escape and we also need to chase it.
So here are some new things I’ve learned this week, that I think you should know as well. And if you are not an AI fan, skip all the way to the end where I share general SaaS content that you may enjoy.
Quantization
I learned about it few weeks back, and I thought I was the last person to find out about quantization and how important it is. But then I keep running into other people who don’t know about quantization, and I keep spreading the word.
Simply put - LLMs (and neural networks in general) are essentially collections of very very large vectors and matrixes. Typically, the data type representing the numbers on these vectors and matrixes are 32bit floating point numbers. Quantization is the process of replacing these 32bit floating point numbers with 16bit or 8bit such that the model performance won’t be affected (or with minimal impact).
Sounds simple, but matters a ton - you can run a model using a quarter of the memory, fraction of the CPUs (or GPUs) and fraction of the latency. With minimal degradation. This is huge in time and money saved. Or may allow you to use models that you simply couldn’t use before.
Hugging Face has a good summary about quantization. If you browse models in Hugging Face, you’ll discover that popular OSS models also have quantized variants. When I built the Sales Insights example and deployed on Modal, Charles Frye suggested that I try replacing Llama 3.1 8B that I’ve been using with an FP8 equivalent from Neural Magic. This was really nice of him to suggest - my app became faster and my Modal bill went down.
Quantization isn’t just useful for models. It is also great for vector embeddings. If you store vectors using quantized types, you can store more dimensions in less space. Again speeding up your vector similarity search. In pgvector (Postgres vector store), using halfvec type allows you to index vectors with 4000 dimensions (instead of the previous 2000 limit). And turns out that in most cases, the 16 least significant bits don’t have much impact on finding nearest neighbors.
Jonathan Katz has a wonderful blog with benchmarks that show the power of quantized types in pgvector. I was so surprised by how few people know about such a powerful feature that I included it in my own pgvector blog.
Matryoshka Learning
Another way to have smaller vectors is to use fewer dimensions. Reducing the number of dimensions in a vector is known as feature extraction and there are several well known techniques for this (PCA, t-SNE, UMAP). But these aren’t always trivial to implement.
It turns out that you can train a model to successfully output vectors in multiple sizes, maximizing the information captured at each size. This technique is called Matryoshka Learning (after the set of Russian Dolls that are nested within each other).
The original paper is worth reading, or if you are in a hurry, Nomic (who implemented such models) wrote a short summary.
ColPali - document similarity search
Searching for relevant PDFs is both a basic example of RAG (“Chat with PDF” is a common beginner example that can be done in 5 lines of code with some frameworks) but also annoyingly complex and challenging to do well. This gets increasingly difficult if the PDF has images, tables, etc.
To do it well, you need to extract the text, extract images, tie images to text in a meaningful way, split all this into meaningful chunks and then generate vector embeddings and store them somewhere. There are some OSS libraries for some of this, but it still isn’t simple and quality isn’t always there.
ColPali is a new approach - instead of extracting text, chunking and embedding it, what if we could embed images of pages directly? It sounds like “this will never work”, but reports say that it works quite well (Caveat: I didn’t try).
You can read the announcement, slightly longer writeup on how this works by the author, and a good blog about the RAG use-case.
Not-AI but worth reading
QCon presentation titled “Building SaaS from Scratch using Cloud Native Patterns” is bound to be interesting to my readers. The Control Plane / Data Plane patterns should be familiar by now. Resource Service / API may be a new name on a familiar pattern, or perhaps it is brand new. Overall, well presented and quite interesting.
As you may know, I’m a big fan of combining analytical real-time data in SaaS products. I also happen to believe that the analytical data has to live in a seperate data store from the OLTP data that drives other parts of the site. But what do you do when you need some of the OLTP data for analytics? This is a bit inevitable and can require some annoying real-time ETLing. TinyBird made this a bit easier with their Postgres integration. Worth checking out if thats your stack.