share_log

Groq CEO on "chip architecture": GPUs are "heavy-duty trucks," while LPUs are "last-mile delivery"

wallstreetcn ·  Jun 12 17:15

Jonathan Ross believes that GPUs are better suited for handling the highly parallel, latency-insensitive prefill phase, whereas LPUs—thanks to their SRAM architecture and static scheduling—deliver superior speed and cost efficiency in low-latency, small-batch decoding phases (e.g., in Mixture-of-Experts models). A hybrid deployment of both architectures can perfectly balance speed and cost. According to Jevons’ Paradox, as computing power becomes cheaper, demand increases; thus, GPUs and LPUs will jointly expand the market.

Groq founder and CEO Jonathan Ross compared NVIDIA GPUs to "18-wheelers" and positioned Groq's own LPUs (Language Processing Units) as "last-mile delivery vans," arguing that combining the two achieves the optimal balance of cost and speed in large language model inference.

In a recent interview, Jonathan Ross elaborated on this architectural division of labor: the prefill phase—reading input text—is highly parallel and insensitive to per-token latency, making it well-suited for exclusive handling by GPUs; the decode phase, however, can be flexibly configured based on users’ sensitivity to speed versus cost, ranging from GPU-only, GPU-plus-LPU hybrid, to LPU-only deployments. He noted that thanks to its all-on-chip SRAM architecture and static scheduling mechanism, the LPU offers significant advantages in low-latency, small-batch decoding scenarios and is particularly well-suited for today’s mainstream Mixture-of-Experts (MoE) models.

Against the backdrop of the rapid rise of agentic AI applications, task decomposition patterns involving multiple AI models calling each other are driving compute demand to expand exponentially rather than linearly. Citing Jevons Paradox, Jonathan Ross pointed out that declining unit costs of compute capacity do not shrink market size; instead, they continuously stimulate growth in total demand—meaning the market opportunities for GPUs and LPUs are fundamentally co-expanding, not zero-sum.

This also helps explain the strategic rationale behind Groq and NVIDIA’s recently announced $20 billion cooperation agreement: in inference workloads, the two companies’ products play complementary roles, and co-deployment yields better results than using either solution alone.

LPUs and GPUs: Complementary Positions Along the Pareto Frontier

Jonathan Ross noted that the per-token cost curves of GPUs and LPUs have markedly different shapes, indicating that the two are not in direct competition but instead serve distinct performance regimes.

“If you’re solely optimizing for the lowest per-token cost, you can just use GPUs with very large batch sizes—it’ll be slower, but cheaper,” he said. “The LPU’s advantage lies in its ability to scale across multiple chips and rely entirely on high-speed SRAM instead of external memory, significantly boosting token generation speed without substantially increasing costs.”

He stated that at the high-speed end of the Pareto frontier, LPUs offer better economics than GPUs; combining the two enables achieving the optimal per-token cost and maximum compute throughput at any target speed.

LPUs are especially well-suited for Mixture-of-Experts (MoE) models. Jonathan Ross explained that GPUs require batch sizes in the hundreds to achieve cost efficiency when reading data from DRAM, whereas LPUs can operate efficiently with batch sizes as small as around 10—resulting in lower queuing latency and higher execution efficiency. “LPUs are almost tailor-made for expert models.”

Static Scheduling and MoE: The Inference Advantage of Deterministic Architectures

Another core differentiator of Groq lies in static scheduling—the order of operations is predetermined at compile time rather than dynamically allocated at runtime.

Jonathan Ross used a calendar scheduling analogy: short meetings must be precisely booked, while long meetings allow more flexibility. "In inference scenarios, you're performing ultra-low-latency, small-batch computations. You must pre-schedule all operations so each computation segment completes quickly and releases hardware resources promptly. This isn’t as critical during training, but it’s absolutely essential for inference."

He also clarified that static scheduling does not mean incompatibility with dynamic routing. In MoE (Mixture-of-Experts) architectures, the LPU’s time slots are fixed, but 'who you meet with'—i.e., which expert’s weights are activated—can vary at runtime through 'scatter and gather' capabilities to enable flexible routing.

Collaboration with NVIDIA: Prefill on GPU, Decoding Based on Use Case

Following a $20 billion strategic partnership with NVIDIA, Jonathan Ross outlined their specific division of labor within the inference pipeline.

"The prefill stage—where input text is read—should run entirely on the GPU, as this phase is highly parallelizable and GPUs excel at it," he said. For the decoding stage, deployment is tiered based on user needs: cost-sensitive users decode entirely on the GPU; paying professional users adopt a GPU-plus-LPU hybrid approach; and extreme-performance scenarios may consider pure LPU decoding.

He anticipates that the market will increasingly see hybrid deployments combining LPUs and GPUs, rather than standalone sales of Groq chips. "Combining the two is like pairing an 18-wheeler truck with delivery vans—you can build a more efficient network."

Jevons Paradox: Cheaper Compute Power Drives Greater Demand

On the long-term trajectory of the AI compute market, Jonathan Ross invoked the 19th-century economic concept known as the 'Jevons Paradox,' arguing that declining per-unit compute costs will not suppress total demand but instead stimulate even greater demand.

"The Jevons Paradox originated from a treatise on coal: whenever steam engine efficiency improved, total coal consumption actually increased," he said. "When the cost of an activity falls, previously unprofitable activities become viable, and people are willing to run more experiments. As AI becomes increasingly affordable, demand for AI will only continue to grow."

He also pointed out that the agent architecture will further amplify this effect. AI breaks tasks down into parallel subtasks, enabling multiple agents to work simultaneously, and employs a multi-layer nested model where AI calls upon other AI systems—this will cause computational demand to expand exponentially. 'AI using AI, which in turn uses more AI, leads to an exponential explosion in usage.'

Jonathan Ross concluded that a 'success disaster' is inevitable—the more computing power Groq and NVIDIA provide to the market, the more the market demands.

Below is the full transcript of the interview:

Host: Jonathan, we’re actually both alumni of Google. When I was at Google, there was a running joke in my team—if you ran out of your daily quota for training models on TPUs, you might as well just take the rest of the day off. I know you were one of the pioneers behind TPUs and later left Google to start your own chip company. What did you observe at Google that inspired you to build something different?

Jonathan: There simply wasn’t enough computing power. At the time, the speech recognition team trained a model that surpassed human performance on transcription tasks—it was the first time they had achieved that. The problem was, they couldn’t deploy it at scale. They ended up limiting its deployment to Nexus phones—you probably remember those; they were older Android devices.

Host: Yes, I used one.

Jonathan: They restricted it to Nexus not so much as a feature choice, but because there was so little compute available that it could only support the user base of Nexus devices. Around that time, I happened to be in New York having lunch with the speech recognition team, and they mentioned this issue. So I started working on it as a 20% project—porting their model onto FPGAs and designing a general-purpose architecture. We quickly realized there was urgent demand on the inference side, which eventually evolved into developing a dedicated chip. Later, Jeff Dean did an analysis and said, 'Given the amount of money and compute we’re going to invest in this, we might as well just build an ASIC.' My initial reaction was: how hard could it be? It turned out to be extremely difficult—but we didn’t know that at the time, so we just jumped in.

Host: I’ve heard you use the term 'success disaster' before, and I think it’s incredibly apt—I’ve experienced that phenomenon several times myself at Google.

LPU vs. GPU: The Pareto Curve and Cost per Token

Host: NVIDIA GPUs perform exceptionally well in training, but face memory bottlenecks during inference. What changes has Groq made to its memory architecture to address this issue?

Jonathan: First, you need to be clear about trade-offs—there’s no free lunch. What you’re optimizing for is the lowest cost per token, because cost determines your compute capacity. Everyone is competing on this metric—if I spend the same amount of money but get only half the capacity, what really matters to me is how many tokens I can get per dollar.

Of course, you also need speed. The trade-off is this: if you pursue only the lowest cost per token, you use GPUs with very large batch sizes, which results in slower speed. What we do with LPUs is scale across multiple chips without relying on any external memory—distributing the model across these chips to leverage much faster SRAM, thereby accelerating token generation without increasing cost.

If you’re familiar with the Pareto frontier, the curves for GPUs and LPUs have quite different shapes. In certain regions of the curve, GPUs offer better economics; in others—particularly at the higher-speed end—LPUs are more economical. Combining the two fills in the gap in between. Together, pure GPU, hybrid GPU+LPU, and pure LPU configurations enable optimal cost per token and maximum compute capacity at any desired speed.

Static Scheduling and Mixture-of-Experts Models

Host: Another differentiator for Groq is static scheduling—the order of operations is predetermined at compile time. What advantages does this offer for large language model inference?

Jonathan: I’ll use a calendar analogy. If I have a series of 15-minute meetings, I must schedule them precisely in advance because the other party needs to show up exactly on time. But if it’s a five-hour meeting, precision matters far less—you just start talking when they arrive, and even a 30-minute delay is negligible within a five-hour window.

In inference scenarios, you’re performing ultra-low-latency, small-batch computations, so you need to pre-schedule all operations to ensure each computation segment completes quickly and releases hardware resources promptly for the next step—otherwise, all subsequent work ends up waiting. This isn’t as critical during training, but it’s absolutely essential for inference.

Host: Today’s state-of-the-art large language models mostly adopt mixture-of-experts architectures, where each query may activate a different subset of experts during inference. How does this work on a chip that uses static scheduling?

Jonathan: The key lies in what is statically scheduled. On an LPU, I’ve already reserved that 15-minute slot, but who I meet with can vary. The LPU has 'scatter and gather' capabilities, meaning that depending on which expert needs to be activated, we fetch different expert weights. The runtime remains the same—we simply switch to a different expert. If experts differ in size, we can even route to another chip; although this introduces brief bubbles in the pipeline, the determinism gives you stronger predictability in timing without restricting what you can run.

Moreover, the LPU architecture is especially well-suited for mixture-of-experts models, because smaller batch sizes are preferred—and mixture-of-experts models are inherently disadvantaged by batch size requirements: when reading data from DRAM, you typically need very large batches (possibly hundreds) to achieve cost efficiency. On an LPU, however, you can operate effectively with batch sizes as small as around 10, meaning you don’t have to wait for many queries to accumulate before execution—thus reducing latency and improving efficiency. The LPU is almost tailor-made for expert models.

Autoregressive and Diffusion Models

Moderator: Speaking of architecture, when Transformers are replaced by the next-generation architecture, would the LPU need to be completely redesigned, or is it orthogonal to the current form of large language models?

Jonathan: That’s a classic question. The LPU was designed before the paper 'Attention Is All You Need' was published. The attention mechanism shares many similarities with existing architectures at the time—such as convolution—even though they are quite different; ultimately, they all boil down to linear algebra. If you build an optimal chip for linear algebra, you’ve built an optimal chip for most of these architectures.

You can choose to optimize for specific matrix multiplication sizes, which may differ across architectures. I’ve seen some people attempt extreme specialization, but flexibility almost always wins out in the end. Consider this analogy: if I told you I could make your runtime ten times faster, but at the cost that you could never change your model again, would you accept? Probably not—because the algorithm itself might improve by tenfold. Recently, there was even a breakthrough that changed how attention works and reduced its scale by a factor of ten. Algorithmic improvements are happening very quickly, and flexibility often matters more than optimization alone.

The LPU architecture was specifically designed with ease of programming in mind, enabling rapid adoption of new architectures and allowing the latest algorithms to be deployed and run quickly.

Moderator: The 'L' in LPU stands for 'Language'—does this mean vision and audio models cannot benefit from the same acceleration?

Jonathan: One of the largest user groups on Groq Cloud today consists of speech-to-text users. We’ve also worked on text-to-speech for a while, precisely because these tasks are extremely sensitive to real-time performance. Many speech models still incorporate components like convolutional layers, which is exactly where a general-purpose architecture adds value—without it, these speech tasks simply couldn’t run on the hardware.

Even more interestingly, higher speed can actually improve quality, which is somewhat counterintuitive. In audio processing, you can split audio into very small segments for processing, but if you only hear a tiny snippet at a time, you lack full context, making word prediction much harder. With slower chips, to meet real-time constraints, you’re forced to use even smaller segments, which increases error rates—similar to having two people transcribe a speech simultaneously, but each can only listen to five seconds at a time, drastically raising the error rate. The LPU can perform speech transcription at hundreds of times real-time speed, enabling it to process much larger segments and thereby reduce error rates in these models.

Moderator: The application scenarios we’ve discussed—language reasoning and audio—are mostly autoregressive; yet many current vision models are diffusion-based, and some large language models are now adopting diffusion architectures. Diffusion-based LLMs run significantly faster than autoregressive LLMs on GPUs—does this performance ranking still hold on Groq chips?

Jonathan: Diffusion models benefit from total computational throughput. Let me first explain what autoregressive means—simply put, autoregressive generation predicts the first token, then the next, much like playing chess: you decide on your next move based on the current state, rather than planning all moves in advance. In language, to determine what the 100th word is, you usually need to know what the 99th word is first.

Of course, you can apply some decomposition strategies: certain tokens are more important than others, so you predict the important ones first and then fill in the surrounding tokens.

I’ve seen many people trying to use diffusion models for language generation, but the results aren’t great. The reason is that it’s hard to determine what to say in one part before you’ve decided what to say in another. This is similar to the audio slicing issue mentioned earlier—imagine 100 people writing a speech simultaneously, each unable to see what the others are writing. Diffusion is called 'diffusion' because information spreads out over time and space, with influence weakening as distance increases.

From a quality perspective: if you generate music using autoregressive versus diffusion models, the autoregressive version will have more soul and depth—you’ll like it better—but it might include one or two minor artifacts. A pure diffusion-generated piece, by contrast, will sound like the cleanest elevator music you’ve ever heard, utterly devoid of soul. However, if you combine the two—using autoregressive methods with contextual awareness for key musical moments and diffusion to fill in the rest—the result is dramatically different.

Just as we combine LPUs and GPUs for decoding in large language models, I believe the ultimately successful version of diffusion-based large language models will likely also integrate both autoregressive and diffusion approaches.

Groq and NVIDIA Vera Rubin Collaboration

Host: At NVIDIA’s GTC conference in March this year, the company unveiled the Vera Rubin supercomputer, specifically designed for inference—particularly in agent-based scenarios. How do GPUs and Groq work together during inference?

Jonathan: Let me give you an analogy. Suppose you’re building a logistics network for the entire United States from scratch. You can choose between 18-wheelers or delivery vans. Delivery vans can access any road but carry less cargo and cost more per unit. The optimal solution is to use both.

In this analogy, GPUs are the 18-wheelers—they can process large batches of tokens at once, but loading and transporting takes some time. LPUs are more like delivery vans—they’re less efficient overall but far more effective for the 'last mile' than that massive vehicle. Just as discussed earlier regarding mixture-of-experts models, LPUs hold advantages in certain components. Combining both is like integrating 18-wheelers and delivery vans—you build a better network.

Large language model inference consists of two distinct components: the weight projection layers and the attention layers. Our approach places the projection layers on the LPU and the attention layers on the GPU, leveraging the strengths of each.

Host: Following the NVIDIA partnership agreement, should we expect Groq chips to continue being sold independently, or will we see more hybrid LPU-plus-GPU configurations?

Jonathan: I think you’ll see more hybrid configurations. For the prefill phase—that is, the stage where input text is read—we still recommend running it entirely on the GPU, as GPUs excel at this task. Moreover, this phase is less sensitive to per-token latency and highly parallelizable, making it ideal for the GPU—the 18-wheeler in our analogy.

The decoding phase depends on the scenario: for cost-sensitive applications, such as free-tier users, decoding may be performed entirely on GPUs; for paying professional users who demand higher speed, a combination of GPU and LPU is most likely used; and for tasks requiring extreme performance, decoding might even be done exclusively on LPUs. Overall, any data center configuration follows this pattern: prefilling is entirely handled by GPUs, while decoding is partially executed on LPUs and partially on GPUs.

Agent-Based Inference and Economies of Scale

Host: The Vera Rubin supercomputer is primarily designed for agent-based inference scenarios. Over the past year, agent applications have risen rapidly—how has this changed the unit economics and costs of scaled inference?

Jonathan: First, I think most people don’t truly understand what an “agent” is—they’re just using the term as a buzzword. Let me clarify this properly, because it’s critically important.

An agent is somewhat like NVIDIA in the AI world—its core capability lies in decomposing tasks into parallel subtasks. CPUs are sequential, whereas GPUs are parallel. If you tackle a task alone, you can only do one thing at a time and often get stuck waiting, resulting in low efficiency. But if you can break the task apart, multiple people can work on it simultaneously. AI faces a similar bottleneck—as we discussed earlier, you can’t generate the 100th token before generating the 99th. However, if you can decompose the problem into subtasks without such dependencies, multiple agents and multiple context windows can operate concurrently. For most problems, this is feasible.

There’s another dimension to this: AI using AI. Just as you might use AI to help prepare questions for this interview, AI can also pose queries to another AI, have it process them in the background, and then integrate the results into its own response. When tasks are delegated to AI, which in turn distributes them to other AIs—AI using AI, which uses yet more AI—it leads to exponential growth in usage. Moreover, answer quality often improves as the number of parallel subtasks increases, much like how larger teams enable more cross-verification, resulting in more substantiated final answers.

Can AI Replace CUDA Kernel Engineers?

Host: Writing CUDA kernels by hand is extremely difficult. Do you think AI is already capable of writing them autonomously?

Jonathan: I think it may already be ‘good enough,’ but this isn’t a black-and-white issue. What does ‘good enough’ actually mean? It’s not a binary choice between ‘writing a kernel’ or ‘not writing a kernel.’ The real questions are: how good is the kernel? How efficient is it? What’s its performance? How easily does it integrate with other kernels? How general-purpose is it? How reusable is it? As AI capabilities continue to improve, kernel quality will keep rising—and the more time you invest in refining a specific kernel, the better it becomes.

Interestingly, the Groq architecture—the LPU—is actually a kernel-free architecture. When we originally designed it, large language models weren’t available to help us write software, so we had to rely entirely on ourselves, and our team was small. Consequently, we built a chip with very low compiler complexity. Just as AI generates increasingly better kernels over time, the easier the hardware is for AI to understand, the better the kernels it can produce. We’re already using AI to program the LPU, and it works very well—because this problem is easy for large language models to ‘mentally simulate.’

Host: AI has lowered the barrier to writing software, and from what you’ve said, we’re starting to see a similar trend in hardware. Will we see more people entering hardware development as a result of this lower barrier?

Jonathan: Absolutely. You’ll see more people attempting to design hardware. But I think there’s an issue—hardware is physical and requires experimentation. Software development offers immediate feedback and enables rapid iteration; hardware involves supply chains and high-stakes bets. You’ll see many people designing chips because it’s becoming easy to do so, but taking a chip to mass production is extremely difficult. This will become a 'baby turtle problem'—global manufacturing capacity is limited, and customers placing bets will choose only those they know are reliable.

Large language models are making it easier to write software and RTL (the programming language for chips), so more people will try their hand at it. However, fewer may actually reach mass production because the trade-offs are so hard—the customers only want to bet on companies they can rely on.

Host: That’s actually very similar to the software world—it’s easy to build a prototype in your bedroom, but bringing it to market and ensuring reliability is far more challenging.

Jonathan: There’s a key difference. With software, if you find a bug, you can issue a patch. If a chip has an error, you first need four to six months to respin the chip. Chips are physical objects—they undergo 60 to 70 layers of chemical deposition during manufacturing, with each layer potentially taking a day or longer. From the moment you complete tape-out (i.e., submit the chip mask) to receiving a physical unit for testing, there’s a fixed physical timeline. The mask itself costs tens of millions of dollars—if you get it wrong, you lose tens of millions. But even worse than that is having to tell a customer, 'Sorry, you’ll have to wait another six months for the product while I make corrections.' On top of that, the way supply chains operate requires you to commit to wafer purchases upfront—if you can’t deliver chips by then, the consequences are extremely severe.

So I don’t think we’ll see a scenario where 'everyone is just throwing chips out there.' Instead, we’ll see many smaller players designing chips, but only a few will succeed—because the stakes are so high, and customers will only choose those they can trust, especially as costs continue to rise.

Host: Is AI making hardware design easier in ways that non-experts might not expect?

Jonathan: There’s a very interesting phenomenon. We’ve noticed that in the past, hardware engineers never wrote software themselves—they’d always go find a software engineer whenever they needed code. But now they’re saying, 'I’ll just implement a small software test myself to see if this design makes sense.' Then they get immediate feedback and realize, 'Oh, this is harder to use than I thought.'

Hardware and software development used to be strictly separate domains. Although the two fields share many similarities, they speak different languages and have subtly different mindsets—when designing chips, you must account for physical constraints like wires and logic gates, which historically made hardware engineers wary of writing software, just as software engineers were hesitant about hardware. Now, however, a hardware engineer can simply ask a large language model to generate a piece of software to run on their hardware. If it doesn’t run smoothly, they immediately recognize what needs improvement. AI has made this kind of cross-disciplinary self-service a reality. Previously, these disciplines had clear boundaries; now those lines are blurring, and practitioners are reaching into adjacent fields to accomplish tasks themselves.

Host: This mirrors the shift we’ve seen between software engineers and designers—software engineers no longer need to wait for finalized design mockups to start building, and many designers are now using code-based tools to directly bring their ideas to life.

Jonathan: Yes, and if there’s a disagreement between software engineers and hardware engineers, they can now directly implement their ideas to prove each other wrong.

Jevons Paradox: Cheaper Compute Power Drives Greater Demand

Host: We started with Google's 'successful disaster.' So, what kind of 'successful disasters' would you hope to see from Groq and NVIDIA in the future?

Jonathan: This brings us to Jevons Paradox—the demand for computing power is infinite. As long as civilization has unresolved problems, we’ll need more computing power. Cancer hasn’t been cured yet, people still age, and we still don’t have enough computing power—that’s three ready-made problems right there. As long as these issues persist, we must keep moving forward.

This means we need smarter AI and more computing power to run more AIs in parallel, solving more problems simultaneously. As we continue advancing, the cost per unit of intelligence will decline, triggering Jevons Paradox—lower costs lead people to spend more.

The origin of Jevons Paradox lies in a 19th-century treatise on coal: the author observed that whenever steam engine efficiency improved, total coal consumption actually increased. The reason is that when the cost of an activity decreases, previously unprofitable activities become viable, prompting people to engage in them more—running more experiments and trying more things. Similarly, as AI becomes cheaper, demand for AI will keep rising until people are spending more and more on AI, requiring ever-greater computing power.

Here’s another analogy: pumping twice as much oil from underground doesn’t automatically mean twice as many people gain access to transportation—you still need cars. But once you’ve trained a model, providing twice the computing power instantly enables twice as many people to use it and solve twice as many problems. Every time you build an AI factory, you can immediately do more, which motivates people to aim even higher and continuously drive down costs. Thus, Jevons Paradox keeps operating. Therefore, 'successful disasters' are inevitable—the more computing power we provide to the world, the more people will want.

What skills should be cultivated in the AI era?

Host: Finally, do you have any message for this audience of technically savvy and curious listeners?

Jonathan: Many people ask me what their children should study. My answer is simple. Our current education system is rooted in the information age mindset—we teach kids to answer questions and provide answers. But with AI, this flips entirely: the key skill becomes asking the right questions. If you can formulate the right question, AI will find the answer for you.

So my strongest advice to all listeners is this: start learning how to ask better questions. Teach your children how to ask better questions. The education system needs to be restructured around the skill of questioning.

If your students can easily solve your questions by simply inputting them into an AI, then you are not teaching them how to succeed in the future. But if you give them a challenge that requires them to formulate their own questions, you are truly preparing them for what lies ahead.

Host: That makes a lot of sense. I took a break from my life in research and entrepreneurship because I found immense joy in directly conversing with AI—asking questions and learning new things. The way I produced this video was by using AI to learn hardware while building it—I could ask those 'why not do it this way?' questions that would never appear in academic papers. Thank you so much for joining us today; this has truly been a delightful conversation.

Jonathan: Thank you for having me.

Editor/Lee

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to EleBank. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.