Magic's Miracle, and why RAG's warded against it.
Attention mechanism's time complexity has dropped, but RAG's still going nowhere.
Magic recently announced a breakthrough in reducing the computational complexity of LLMs’ response generation. They've made it possible to work with context windows as long as 100 million tokens—that’s equivalent to 650 novels. This is a significant leap, with far-reaching implications. We’re looking at faster inference times, and lower costs, sometimes dropping by several orders of magnitude, especially for ultra-long context windows.
With advancements like these, combined with Groq-like hardware (think ASICs in bitcoin mining), the cost of inference could drop to a point where "human in a box"-level intelligence becomes cheap. The timing is perfect to build inference-hungry, multi-modal applications that will thrive on such advances.
But let’s not forget—RAG (Retrieval-Augmented Generation) isn’t going anywhere. Let’s compare the cost of LLM generation - prior to Magic’s breakthrough, and after Magic’s breakthrough.
Please keep in mind that when I talk about cost of LLM generation, I’m referring to time, dollar cost, and compute (i.e., FLOPs). They’re all largely proportional and can be used interchangeably.
Pre Magic
Era: The Case for RAG
Let’s break it down. Suppose you have a knowledge base of length m
and instructions of length n
. Typically, m » n
. Now, if we could process LLM inputs that are infinitely long, the time complexity of LLM generation would be O(m² + n²)
.
Enter RAG. With a RAG-like solution, complexity drops dramatically to O(log m + n²)
. Here, lookup in a vector database is O(log m)
, assuming retrieval yields a constant number of chunks, each of constant length. This is such a radical drop in complexity that it’s hard to imagine anyone forgoing this tool, especially if the benchmarking statistics on correctness remain largely unaffected.
Post Magic
Era: Why RAG Still Makes Sense
Now, imagine the cost of LLM generation dropping to O(m + n)
due to Magic’s breakthrough. Even in this scenario, RAG-like solutions further improve the time complexity to O(log m + n)
. Given that m
is still much greater than n
, this reduction in computational complexity remains highly attractive.
TL;DR
The next generation of human-computer interfaces—think voice bots, video chats, or humanoids capable of natural communication—will demand sub-second latency and ultra-low costs to deliver a seamless user experience. Many scenarios will also require searching through web-scale databases to find solutions. To achieve this, RAG-like solutions will remain essential in making such technology both feasible and marketable in consumer applications.