Google Cloud CEO: Proprietary TPUs build competitive moats; the eighth-generation chip is set to release soon, with external demand already surpassing supply limits.

wallstreetcn · Apr 25 13:31

The CEO of Google Cloud disclosed that the eighth-generation TPU will be split into two lines: training (v8T) and inference (v8i), leveraging over a decade of in-house chip development to build a structural competitive advantage. Amidst a global shortage of computing power, Google's self-developed chips have significantly reduced the unit costs of model training and inference, with external demand far exceeding supply. Kurian warned that players without self-developed chips and unable to offset training costs through inference face the risk of running out of funds.

As global AI labs are deeply mired in a 'computing power crunch,' Google-A (GOOGL.US) it is building an insurmountable structural moat based on over a decade of self-developed chip expertise that competitors will find hard to replicate.

Google Cloud CEO Thomas Kurian recently stated in an exclusive interview that the upcoming eighth-generation TPU will be split into two independent product lines: the v8T focused on large-scale training and the v8i optimized for inference, with each training system capable of accommodating two petabytes of memory. The external demand for TPUs from AI labs has already "far exceeded our capacity to meet," which he cited as the most direct evidence of cost competitiveness: "If our costs were much higher, they wouldn't come to us for TPUs."

The advantages of chips are rapidly translating into business growth. Kurian revealed that the token processing volume for the enterprise version of Gemini surged from 10 billion per minute in January this year to 16 billion, with enterprise users growing by 40% month-on-month. He also issued a warning to the industry: in a market environment where computing power production remains constrained, players without self-developed chips will face "increasingly high" unit economic costs; and business models reliant on venture capital funding, unable to cover training costs with inference revenue, will eventually run out of financial resources — "This gap is widening, and the sources of funding you can tap into will become fewer and fewer."

Kurian characterized this advantage as a long-term barrier spanning the next decade, responding to external doubts about Google's simultaneous service to competitors like Anthropic with the "platform company logic": providing underlying computing power to rivals while competing at the model layer is not contradictory. Moreover, precisely because TPUs serve both internal and external needs, Google secures more favorable contract terms in supply chain negotiations, further deepening this structural moat.

The compounding effect of eleven years of accumulation: TPUs evolve from specialized AI chips to general-purpose computing power

Kurian traced Google's current computing power advantage back to the TPU self-research project initiated over eleven years ago. He noted that Google foresaw the arrival of the AI wave many years ago and made early preparations across multiple dimensions, such as energy diversification, land reserves, and transformation of data center construction models, to ensure no physical resource constraints.

At the data center construction level, Google has shifted from traditional construction methods to a factory prefabrication model, enabling larger-scale pre-assembly and pre-testing, thereby significantly shortening deployment cycles. Kurian stated that the cumulative effect of these decisions has created a compounding impact across all layers of the technology stack — from TensorFlow to JAX, to XLA and Pathways. The complete programming stack Google built around TPUs is one of the core sources of current system efficiency.

Notably, TPU application scenarios have begun to extend beyond AI. Kurian mentioned that Citadel, a hedge fund, has publicly discussed how TPUs are being used for algorithmic trading in capital markets, and clients from the U.S. Department of Energy and high-performance computing fields are adopting this solution. The rationale lies in the fact that algorithmic trading previously relied on numerical computation, constrained by the slowing of Moore’s Law, whereas shifting to inference-based computation delivers significant performance improvements. Some top financial institutions have requested TPUs to be deployed in client-owned data centers near exchanges, and Google is exploring this new business model.

The eighth-generation TPU splits into two product lines for inference and training to meet the demands of the intelligent agent era.

Kurian revealed that the upcoming eighth-generation TPU includes three products: the v8T for large-scale training, and Ironwood for mixed purposes. The v8i is primarily optimized for inference scenarios and can operate without liquid cooling, facilitating deployment to more locations to manage inference latency.

In terms of technical specifications, Kurian introduced that the v8T training chip can accommodate two petabytes of memory within a single system, equivalent to approximately 100 times the entire digitized content of the Library of Congress. The v8 features 9,600 interconnected chips, while the v8i has 1,152, all operating on a unified optical Taurus network with extremely low predictable latency and exceptionally high data throughput efficiency from memory to chip.

Google uses 'goodput' (effective throughput) as its core metric. Kurian stated that Google predicted several years ago that energy supply would become strained, thus focusing on optimizing the number of tokens produced per watt of computational power. This decision has now become a key reason why many customers choose TPUs. He explicitly stated that Google is fully confident in serving the world's largest models with TPUs, and its disaggregated serving technology stack achieves the highest utilization efficiency among all model providers.

Regarding industry discussions about the slowdown in pre-training expansion, Kurian provided a clear response: 'We do not see this slowdown at the chip design, system design, or production capacity levels.'

The Era of Agents Reshapes Computational Architecture: Storage Bottlenecks Become the Next Key Constraint

Within Kurian’s framework, AI applications are undergoing three evolutionary stages: the first stage centered on search and question-answering, the second characterized by multimodal content generation, and the third focused on agents autonomously completing complex tasks. He pointed out that the rise of agents is fundamentally altering the optimization direction of chip and system designs.

Agent tasks may run continuously for 6 to 12 hours, posing new requirements for KV cache design. Controlling memory residency costs will directly determine the economics of inference services. Meanwhile, inference scenarios require distributed deployment across numerous locations, which contrasts sharply with training that can be concentrated in a few hyperscale sites. The v8i's support for air-cooled operation is a direct response to this need.

At the storage level, Google is set to launch two new solutions: a managed Lustre solution for large-scale training with a throughput of 10 terabytes per second, and an ultra-low-latency 'Rapid Storage' for inference scenarios with a throughput of 15 terabytes per second, mountable near the inference chips. Additionally, Google will introduce a new network architecture, Virgo, providing ultra-low-latency high-speed interconnects within hyperscale clusters.

Kurian noted that the next major bottleneck in agent adoption will emerge on the consumer side—enabling virtual machines to activate and deactivate on demand while efficiently handling local storage read/write operations will be the core engineering challenge in reducing agent usage costs and achieving mass adoption.

Business Model Under Platform Logic: Supplying Computing Power to Competitors Does Not Undermine Own Competitiveness

In response to external doubts about Google's contradiction of providing TPU computing power to Anthropic while directly competing at the model level, Kurian attributed it to the inherent logic of a platform company. He stated that different business units of Google simultaneously have both competitive and cooperative relationships with market participants, and Apple's signing of a model contract with Google is also an embodiment of this logic.

Regarding how to balance internal computing power demand with external supply, Kurian noted that allocation decisions are jointly discussed and made by the management team led by Sundar Pichai. He emphasized, "Having our own chips and needs is far better than not having our own chips." Since Google does not rely on external chip procurement, regardless of allocation, it can still generate profits based on its own intellectual property, which fundamentally differs from a business model that purely resells others' IP.

When comparing NVIDIA’s total cost of ownership claims, Kurian responded with customer feedback: "Many of our customers say that our total cost of ownership is the lowest." He reiterated that the demand for TPUs from numerous external AI labs has already exceeded Google's supply capacity, which he considered the most direct proof of cost competitiveness.

Cybersecurity emerges as a new battleground in the AI arms race, with Google introducing a three-layered response system.

Kurian expressed high vigilance regarding the risks of AI models in the field of cybersecurity. He pointed out that no matter how much the spread of closed-source models is restricted, open-source models will inevitably fall into the hands of adversaries and continue to evolve over time. Therefore, the core issue lies in: What proportion of vulnerability detection capabilities, which Anthropic deemed too dangerous and postponed releasing under Mythos, can be replicated by open-source models?

Google's response strategy is divided into three layers: First, leveraging Gemini to enhance the speed of vulnerability detection and launching new models that assist in code repair, as vulnerabilities are now being discovered faster than humans can fix them. Second, introducing 'continuous red team exercises' agents—where one agent continuously launches attack tests, another prioritizes vulnerabilities, and a third assists in completing repairs. Third, after integrating with Wiz, continuous monitoring capabilities will be incorporated into the cloud security system, forming a closed loop from discovery to repair and deployment.

Kurian also refuted the assertion that "AI will replace software engineers." He argued that, given the surge in security vulnerabilities brought about by improvements in model capabilities, this is precisely when large numbers of software engineers are needed to collaborate with models. The industry tends to overcorrect with predictions like "no one will be needed anymore," but reality often proves otherwise. Google adheres to the peer code review system and is exploring the introduction of 'supervisory models' to examine AI-generated code in different ways, addressing the cognitive blind spots arising from AI generating and reviewing code simultaneously.

Below is the full transcript of the interview:

Host: Alright, Thomas, thank you for coming today to be interviewed. We are at the Google Cloud campus, and I greatly appreciate you taking the time.
Thomas Kurian: Thank you for the invitation.
Host: I’m really looking forward to this conversation—I have many questions to ask you.
Thomas Kurian: Sure, go ahead.
Host: The first question I've been pondering recently is about TPU capacity. When you look at leading labs like Anthropic and OpenAI, they constantly mention being constrained by computing power. But for Google, you have the full technology stack, your own chips, and you're not only serving your own inference needs but also providing training services, selling inference services, allowing some competitors to build products on your chips, and even directly selling the chips themselves. How do you manage to maintain such abundant capacity while other cutting-edge labs always seem to fall short?
Thomas Kurian: Consider how much of our global resources we’ve monetized — in some cases, we charge for both computing power and inference requests; in others, we provide computing power to run third-party models, but the underlying chips are ours. Part of this goes back to long-term planning that we did years ago. When we foresaw this wave of AI coming, we took steps across multiple dimensions to ensure we wouldn’t be constrained by physical resources.
We diversified our energy sources, secured land in advance to build data centers, and changed the way we construct data centers — moving from traditional construction methods to more factory prefabrication because factory production is always faster than on-site construction. We also shortened the machine deployment cycle. All of these efforts have helped us significantly in terms of capacity.
At the chip level, we have maintained a partnership with NVIDIA, but at the same time, we’ve been committed to developing our own chips for what I believe is either the eleventh or twelfth year now. Our eighth-generation TPU will be officially announced at our upcoming launch event.
Host: Yes, we'll get to that in a moment.
Thomas Kurian: We've accumulated significant experience in this area over time, delivering this advantage generation after generation. What's particularly interesting now is that we’re seeing demand not just from AI labs but also from other industries. For example, Citadel has publicly discussed how they use our TPUs in capital markets; clients from the U.S. Department of Energy and the high-performance computing field are also talking about it. So we’re seeing TPUs becoming more versatile, moving beyond AI algorithms to become part of a broader infrastructure.
Host: When you face such a large opportunity with TPUs and need to allocate computing resources among different uses, how do you compare and balance them? If you’re willing to share specific figures, that would be great, but even a rough comparison would help — like selling TPUs directly, allowing companies like Anthropic or OpenAI to run inference on your infrastructure, or serving your own Gemini model. How do these models compare?
Thomas Kurian: We strike a balance in our investments across these areas, and regardless of the approach, we achieve good profitability because we own our intellectual property. We’re not merely redistributing someone else’s IP. I think this has helped us, as evidenced by our continued growth in revenue and operating margins.
We’ve also expanded TPUs into new scenarios, such as capital markets. We’ve observed an intriguing phenomenon: algorithmic trading used to rely heavily on numerical computation, which was primarily executed on traditional computing power, constrained by Moore’s Law, where performance gains between generations were slowing down. Many top institutions have realized that shifting to inference-based computation can deliver substantial performance improvements — rather than using numerical methods, transitioning to inference allows them to benefit from the rapid advancements in inference performance. As these institutions come on board, they want our machines deployed closer to exchanges, such as within their own data centers. So we’ve started placing TPUs in some key clients' facilities, which represents a slightly different business model.
From a macro perspective, I believe diversification itself drives product improvement because you receive demand feedback from various sources. The diversification of commercial channels also helps us achieve growth. For instance, when we negotiate with supply chain vendors, the fact that we use these chips not only to meet our internal needs but also to serve the market leads them to recognize that Google’s demand represents a much larger overall pool, enabling us to secure more favorable contract terms.
Host: I’d like to dwell on this point a bit longer. If the demand for computing power is infinite, even from just an R&D perspective, why not keep all the computing power for yourselves? To be more direct, if AGI truly is the ultimate goal pursued by all AI labs, and whoever reaches it first and deploys it at scale wins, wouldn’t it make the most sense to allocate all production capacity to your own models? Where am I going wrong in this logic?
Thomas Kurian: You need to generate revenue to sustain all of this. While Google does make a lot of money, you must continuously generate cash flow, and this is another lever for us to produce sufficient cash flow. The allocation of computing power to external parties always involves balancing our own needs against capital requirements. And you know, regardless of which lab you are, venture capital cannot indefinitely support you. As computing costs continue to rise, if you’re operating at a loss—losing money and earning insufficient revenue from services like inference to cover training costs—the gap will widen, and your available funding sources will become increasingly scarce.
Host: I’ve been emphasizing how uniquely positioned Google is—with its cash-generating businesses, chips, and models. Has your Gemini team ever come to you and said, “We don’t have enough”? I know I’m fixated on this point, but the fact that other companies can’t keep up just astounds me.
Thomas Kurian: There will always be demand for such resources, and I believe that over the next decade, demand will consistently exceed supply. If you have your own chips, you’re in a great position. If you don’t, you’re merely reselling someone else’s products. In an environment where capacity is constrained, your unit economics will become increasingly expensive. In our case, since we control the chips, our unit economics remain attractive. Therefore, having self-developed chips will be one of our core competitive advantages.
Host: What if you consider your entire TPU computing pool and overall computational infrastructure as one big pie? Can you discuss roughly what percentage is allocated to training, inference, selling TPUs, and providing inference services to other labs?
Thomas Kurian: Broadly speaking, we don’t disclose detailed figures, so I won’t break it down item by item. But generally, from a macro perspective, Google Cloud accounts for about half of Alphabet’s total capital expenditure, and it continues to grow because its growth rate far exceeds that of other businesses, as you are aware. So that’s a rough breakdown. On our side, a significant proportion of our growth comes from Gemini and our own models, which you can take as a rough reference.
Host: Alright. You mentioned data centers and data center construction earlier. Could you explain the difference between what you referred to as 'construction' versus 'factory manufacturing' at the data center level?
Thomas Kurian: Simply put, it’s about the fundamental unit you use when deploying capacity. For example, you can assemble racks one by one within a data center, or you can deploy entire rows at once. The larger the granularity with which you deploy, the more pre-assembly and pre-testing you can complete at a centralized location, leading to faster deployment speeds.
Host: When planning the deployment of new data centers—I assume you’re more aware of this than anyone else—the American public currently holds a rather negative view of data centers, with approval ratings around 20%, I believe. How do you perceive this issue? And how should the entire AI industry work to change public perception of artificial intelligence and the deployment of data centers—especially since data center deployment actually gives the U.S. a strategic advantage? Personally, I’m quite optimistic about AI. What’s your take on this?
Thomas Kurian: People's concerns about data centers mainly focus on several aspects. First, will data centers drive up energy prices in my state or county? Second, can the communities where data centers are located obtain sufficient employment opportunities?
In response to these issues, we are doing a few things. First, we are investing in 'behind the meter' technology, meaning that we do not draw power from the grid but interconnect with it when the state government is willing. This way, when there is a shortage in the grid, our energy can be fed back into the grid. We are investing in alternative energy because we believe the traditional 'power generation plus distribution' model is not the only way for energy supply to enter the market. Therefore, one of the questions we are studying is: Can the energy demand brought by AI promote the emergence of new energy distribution methods, thereby reducing the cost per unit of energy and serving a broader market?
Third, we place great emphasis on the PUE (Power Usage Effectiveness) metric, which refers to the efficiency of every kilowatt-hour of electricity we consume. Simply put, if you need 100 megawatts of computing power, the fewer additional megawatts you actually consume from the energy side, the less energy you waste. We are the most efficient globally in this aspect, involving thousands of optimization details related to thermodynamic exchange and heat dissipation methods.
Finally, we are deeply committed to the communities where we operate. To prevent local communities from feeling that Google concentrates all resources in one massive location, we distribute our data centers across many places so that no state feels we have become a heavy burden on their resources. We have an excellent track record in this regard. I have visited many of our data centers, and when you delve into the local economic environment, see the children in local schools, and meet the employees who operate our data centers—who are extremely important to us—and witness the economic development we bring to those remote communities, you realize this is part of our responsibility.
Host: That’s good. But what about broader societal perspectives, rather than just the local community you enter—where you create jobs, invest funds, and don’t directly drive up electricity prices—all of which is excellent—but how do you genuinely change the perception of the broader American public towards artificial intelligence?
Thomas Kurian: This will be a process. I believe the key lies in identifying application scenarios where technology can truly benefit society, rather than triggering fears about job displacement. Let me give a few examples.
At our launch event, you will see a company called Signal, which doesn’t often publicize itself—they are Germany’s largest health insurance company. They have deployed AI agents built on Gemini Enterprise at scale to assist their teams. Interestingly, when we first started collaborating, there was significant internal anxiety about potential layoffs. However, they didn’t lay off a single person, and they found that accuracy and speed in answering customer questions such as “Am I eligible for reimbursement for this treatment?” improved dramatically—issues that previously took 23 minutes to resolve now take only seconds. Thus, this has enhanced both efficiency and customer service quality without affecting any jobs.
We also collaborate with the American Society of Clinical Oncology—the industry organization representing 51,000 oncologists nationwide. They wanted an AI application to help doctors consult standard treatment guidelines during consultations. For instance, when a patient visits with breast cancer, what is the standard treatment plan? But she also has diabetes, and if it is this type of diabetes, chemotherapy cannot be administered—these rules are extremely complex and often overlap. They hoped AI could provide accurate answers, which must be 100% reliable without generating hallucinations. We helped them achieve this, enabling doctors to better care for patients, and the feedback from their members has been very encouraging.
There are many such examples. We often say that one of the most important applications is creating a 'wealth advisor.' Consider the situation of ordinary citizens: If you are a high-net-worth individual, you can go to a private bank and receive professional wealth management advice; but if you are an ordinary person without those financial resources, you may not get high-quality financial advice at all. Citigroup is developing a wealth advisor app, which they will showcase at the event. This app will leverage Gemini’s reasoning and task management capabilities to provide users with financial advice and assist them in executing investment operations when needed.
These are examples where society will recognize their value. It will take time to balance the narrative from 'AI will cause mass unemployment' to hearing this perspective, and this is part of the journey we, as a society, are collectively undertaking.
Host: I would like to continue discussing a topic— if the demand for computing power is infinite, especially at the R&D level, why not simply keep all the computing power for yourselves? Having your own chips and maintaining unit economic efficiency in a computing-constrained environment will be a significant advantage for Google because you possess these chips. Next, I would like to ask you about model releases and safety thresholds: do you have any red lines or benchmarks that determine when Gemini is no longer safe enough to release to the public?
Thomas Kurian: The demand we receive from all other AI labs has already exceeded our capacity to meet it.
Host: Thomas, what keeps you up at night?
Host: I do agree. I believe that the issue of job displacement is a major concern, particularly for the average American. Let me directly ask you — regarding your organization, Google Cloud, now that artificial intelligence is significantly enhancing the productivity of your engineers and employees in other departments while increasing automation, are you hiring, laying off, or maintaining stability? What stage are you currently in?
Thomas Kurian: We are increasing headcount in both product and sales. We have recruited a large number of people for our market expansion teams and are also hiring extensively for deployment engineers. In areas where we are developing new products, we are expanding our capabilities as well.
Let me give an example that people usually don't see — a long time ago, we anticipated two things: first, as models become increasingly proficient at understanding code; second, as models learn to use computers to perform tasks, they can excel at many things. However, one issue with understanding code is that models can also identify vulnerabilities within the code, thus triggering significant anxiety about cybersecurity vulnerabilities. We'll talk more about this topic later.
A long time ago, we decided to do three things: first, enhance vulnerability detection capabilities through Gemini, which is already being used by a large number of clients; second, create a model capable of fixing code — because if you can quickly identify vulnerabilities, human intervention often cannot keep up with the repair speed, so can the model assist you in making repairs? We are about to launch a new feature addressing this point. Additionally, following our acquisition of Wiz, you will see new capabilities combining Wiz's offerings, with continuous detection at its core.
Some call it 'continuous red teaming.' We will showcase three different types of agents: the first agent continuously attacks you, ensuring that vulnerabilities are fixed promptly and preventing surprises — something previously unachievable; the second agent prioritizes identified issues, helping you determine which vulnerabilities need immediate attention; the third agent assists you in completing the repair work.
Host: I am glad to hear that you are still hiring — efficiency is improving, and you are expanding recruitment. However, there are indeed companies out there taking a different approach. Block is a typical case; Jack Dorsey wrote a blog post stating that Block cut nearly half of its workforce, citing AI as one of the reasons. What do you think is the difference between Google's approach of 'increasing efficiency while continuously expanding recruitment' and Block's method of 'restructuring the company to achieve better results with half the workforce?'
Thomas Kurian: Each company faces different demands for its products and services, and every CEO makes their own judgment. What we see is strong market demand, so we choose to continue investing.
Host: Let's talk about NVIDIA. Jensen Huang recently appeared on Taresh’s podcast, where he discussed how NVIDIA and its architecture offer the lowest total cost of ownership per token, thanks to CUDA, NVLink networking, and various toolchains that provide better token economics. Do you agree with this assessment? Do you think Google is the most competitive in terms of total cost of ownership? If not, how does Google plan to catch up?
Thomas Kurian: Many of our customers say we have the lowest total cost of ownership.
Host: Well, I guess that’s the answer, right?
Thomas Kurian: Yes. The reality is, if you’re an AI lab, you’ll choose the best platform. It’s not just Google's internal teams using it—other AI labs' demand for our TPUs far exceeds what we can currently supply. I would simply say: if our costs were significantly higher, they wouldn’t be coming to us for TPUs.
Host: Is speed one of the core advantages of TPUs? I’ve noticed the Gemini series models are extremely fast, and as a speed enthusiast, I really appreciate that. Generally speaking, dedicated ASIC chips tend to be much faster than general-purpose GPUs. Is this a major selling point for AI labs or your customers, or do they always prioritize quality above all else?
Thomas Kurian: Quality. Quality comes first. But I think it’s a combination of three core elements—because it’s not just about the chip itself, but the entire system. Take TPU v8 as an example, which has 9,600 chips; v8i has 1,152 chips, all connected via a single optical Taurus network. This provides extremely high bandwidth and ultra-predictable low latency between all chips within the entire Pod. This allows us to retrieve data from memory for processing and write it back to memory with extremely high efficiency. For instance, the v8T training chip can house two petabytes of memory in a single system—approximately 100 times the digitized content of the Library of Congress.
With extremely low network latency, the data throughput from memory to chip is also exceptionally fast. Thirdly, at the hardware layer, from a programming stack perspective, Google has developed and contributed a wealth of tools to the industry. For instance, JAX for compiler optimization, significant work on PyTorch, as well as XLA and Pathways—all technologies built by Google. When you put it all together, even when looking at inference and vision-language models, we have deeply optimized many of these technologies. It’s this entire technology stack that makes the TPU system so efficient and powerful.
We measure this through a metric called 'goodput'—which measures the actual effective throughput you receive. A few years ago, we also made a decision: foreseeing that energy supply would become constrained, we focused on optimizing the cost-performance ratio per watt, i.e., how many tokens can be generated per watt. This is one of the key reasons many people choose us today.
Host: You mentioned earlier that the development of TPUs has been ongoing for 11 years. In the tech industry, 11 years is quite a long time. It’s truly remarkable to see such a long-term decision bear such fruitful results in recent years. So, how much adjustment does your planning undergo as the market changes? Are decisions made years ago still being firmly implemented, or do you need to constantly adjust your direction?
Thomas Kurian: The historical experience we’ve accumulated across all layers of the technology stack compounds over time, creating a compounding effect. When we worked on TensorFlow, we realized that training required a large-scale distributed programming model, so we developed JAX. JAX was iterated based on our experience with TensorFlow and the growing demand for new distributed training models. So, many things accumulate over time—we learn from past practices and continuously improve.
Meanwhile, we are also extremely attentive to the market and listening to our customers' feedback. For instance, someone asked us: why did we specifically develop the v8i inference chip? The reason lies in a pattern we observed — no matter how financially robust a company is, if it cannot generate revenue through inference, it will be unable to sustain the costs of training. You must at least ensure that inference revenue can offset the cost of training, rather than perpetually relying on venture capital for funding. Therefore, we anticipated an explosive growth in the demand for inference, clarified the optimization direction required for inference, and in fact, the market demand for the v8i inference chip far exceeded our initial expectations.
Host: Let’s talk about the eighth-generation chip. This is the first time you have split the chips into two distinct series — one focused on inference and the other on pre-training. First, could you confirm whether Ironwood is primarily designed for inference?
Thomas Kurian: Ironwood serves multiple purposes, used for both training and inference. I believe there is a strong temporal pattern in how people use inference — during the day, users wake up and ask numerous questions, but at night, some still go to sleep, so many inference tasks run on Spot instances during that period; similarly, post-training fine-tuning is also often done using Spot instances overnight. Thus, Ironwood is a general-purpose chip. v8T is mainly aimed at training, though some are considering using it for inference. v8i is primarily intended for inference, but for smaller models, some also use it for training.
Host: Based on your decision to split these two chips, where do you think computational workloads are heading? What trends are you observing now? Where will the primary workloads be concentrated over the next five years?
Thomas Kurian: This trend is reflected not only in chip design itself but also in the work we have done with Gemini. If you look at Gemini, we roughly observe three stages of model evolution:
In the first stage, users ask questions to the model, which then provides answers. There may be multiple rounds of iterative dialogue, but overall, it resembles a search-based chatbot experience. Our enterprise version of Gemini provides search and Q&A capabilities, augmented by a 'deep research' feature for in-depth analysis.
In the second stage, people previously relied heavily on diffusion models to generate content, such as images, audio, and video. Starting with Gemini 2.5 Nano, multimodal input has always been present, but multimodal output became a native capability of the main model. We see creative firms like WPP, as well as various consumer goods companies, utilizing our enterprise AI platform to create content, leading to diverse content creation scenarios.
Then, models become increasingly powerful in handling various levels of abstraction in real-world applications. By 'abstraction,' we mean that in enterprise scenarios, models need to integrate with a variety of systems — for example, connecting to CRM systems to answer customer-related queries or accessing supply chain and planning systems. The ultimate abstraction is conceptualizing the entire world as a computer — because if you can converse with a computer, the computer can interact with everything, as all software essentially represents an abstract form of communication between computers and the external environment.
Host: Do you consider the idea that 'models can control computers and use browsers' as the ultimate form of abstraction? And it's not just about 'I can talk to a computer,' but also about understanding the information returned by the computer and responding appropriately — do you understand what I mean?
Thomas Kurian: Yes, this is precisely the origin of the concept of an 'agent.' An agent is a module to which you can delegate tasks. The agent describes its skill set, knows how to operate a range of tools, including computers, and can perform tasks on your behalf. This enables Xfinity to use our technology to coordinate and manage their entire customer service system, Walmart to apply us across various scenarios from supply chain planning to scheduling, Bosch to utilize us in manufacturing, and Merck to explore how we can assist in research — from drug discovery all the way to delivering medications to patients, automating the entire process. This represents the next evolutionary stage.
To some extent, we are engaged in 'co-design' – as model capabilities advance, we are able to continuously expand the boundaries of tasks that can be automated.
Host: Bringing this back to the decision to split chips – separating inference and training, what is the intrinsic relationship between the two?
Thomas Kurian: Returning to the first phase, which is the question-and-answer search phase: the number of input tokens far exceeds output tokens because you give the model a long and complex question, and it returns a relatively simple answer.
In the content generation phase, you only need to provide a simple prompt, such as 'generate a video showing my dog wearing a Superman cape while driving,' and then the model needs a significant amount of time to generate a large volume of output tokens. This results in a markedly different token composition ratio – multimodality is a major variable, and the volume of output tokens also increases significantly.
By the agent stage, the impact on chip design manifests across three or four different dimensions. First is the issue of memory residency: the tasks you assign to an agent may need to run for six, seven, or even twelve hours, and you do not want to frequently swap content in and out of memory because that would incur high token computation costs. Therefore, the design of the KV cache needs to be reconsidered – this is a typical example.
So when people ask us how these experiences have influenced the direction of your chip development – we not only collaborate with Intel, but we also have our own ARM-based chips, which we developed because we recognized the demand for general-purpose computing power brought by these tools. When running an inference agent that requires executing many, many different steps, there are aspects related to how you want to retain and anchor objects within the model, enabling the model to operate extremely efficiently since this greatly optimizes inference costs. We have done extensive internal work on how chips preserve data in memory. Additionally, because people want more intuitive examples – they wish to deploy inference in many locations due to latency management needs, unlike training, which can be concentrated in a few large-scale sites.
So a practical example is: 8i can operate without liquid cooling, allowing you to deploy it in more locations since air cooling remains the primary heat dissipation method for most data centers. There is considerable thought behind these decisions. I’ve just provided three simple examples to illustrate.
Host: Yes, I think the agent aspect is indeed fascinating because it truly changes how these tokens are used in practice. NVIDIA talks extensively about extreme co-design, and Google appears to be pursuing extreme co-design at every level.
Thomas Kurian: Yes.
Host: Let’s first discuss the use cases for agents, especially when you need to perform substantial disk read/write operations, where many aspects require optimization. On the TPU technology stack, what have you recently optimized? Based on the growth in agent usage, where do you see the next major bottleneck?
Thomas Kurian: We have been reviewing the entire system. Here are a few examples: next week, we will launch two entirely new storage solutions.
The first is our managed Lustre solution, where we have increased its throughput to 10 terabytes per second. It is specifically designed for large-scale training. You can interconnect it with a hyperscale cluster because you possess large datasets, and now you can read data from a large Lustre cluster into a large training cluster, achieving highly efficient scaling.
The second is a completely new ultra-low latency inference storage system that we are introducing, called 'Rapid Storage.' The concept is this: you can centrally store information needed for inference in cloud storage but mount it close to where the inference chips operate — think of it as a forward-proxy mechanism. Data retrieval from your inference processors to the Rapid Storage system is extremely fast, reaching speeds of 15 terabytes per second, enabling ultra-low latency.
At the same time, all of these need to be optimized over a unified network backbone. Therefore, we are launching a new type of network architecture called Virgo, which provides ultra-low latency high-speed interconnection within hyperscale clusters. Moreover, there are many other layers of work we are undertaking through co-design, all aimed at preparing for the arrival of intelligent agents. The core objective is to provide people with the best-performing, highest-quality intelligent agent runtime environment at the most efficient cost structure.
Host: Where will the next major bottleneck emerge?
Thomas Kurian: The next major bottleneck will largely appear in the consumer usage of virtual machines. For instance, I am a home user who has built an intelligent agent to assist me with travel planning — let’s say you’re going on vacation and assign it a series of tasks, such as querying eight travel websites. These websites are exposed as tools, commonly referred to today as MCPs or APIs, allowing the agent to search across all travel sites. Suppose you want to book a trip to Europe or Southeast Asia; it calculates the total cost and informs you about your budget.
Consumers cannot afford the cost of running virtual machines permanently, as it is very expensive, as you know. Hence, people prefer to activate and deactivate virtual machines on demand once tasks are completed. Additionally, since these tools require local storage, while virtual machines can be over-provisioned, you can configure local disks to achieve ultra-efficient read and write operations. This will become a bottleneck because it directly affects the breadth of adoption for this technology. Enterprises can certainly pay for this, and the cheaper and more efficient it is, the more they can utilize it. However, if you want to extend this technology to consumers, the costs can quickly become prohibitive for them. If you aim to reach everyone, you must address these cost structure issues at the engineering level. It is precisely the capability to integrate across layers — from the intelligent agent layer to the Gemini layer, down to the storage and compute systems — that enables us to perform co-design.
Host: Thank you for sharing. I’d like to discuss Anthropic. Anthropic is a customer of Google and unique in many ways. Claude is one of Google's strongest competitors, yet at the same time, we provide the infrastructure for much of their training and inference work. How do you view this decision? I know we touched on this earlier, but I’d like to delve deeper: how do you feel about providing computational power to Anthropic’s models when they are also competing with Google? Is this similar to AWS’s approach — serving everyone without favoring any party — or is it different?
Thomas Kurian: Google is a platform company. When you are a platform company, different parts of your business compete with various players in the market, while some parts supply them and others compete against them. We are committed to being the best in the industry at the model level, and we take immense pride in our work, not just the Gemini model itself but the complete toolchain we’ve built around Gemini and our enterprise tool suite. At the same time, some customers wish to use our TPUs, and Anthropic is one example. This is simply part of being a platform company. Just as people ask us how our collaboration with Apple on model optimization is going. Apple has signed a model contract with us, as you know. So people ask: isn’t this competing with your Android platform and ecosystem? Yes, but that’s part of being a platform company.
Host: I’m still somewhat fixated on the Anthropic question because they are direct competitors at the enterprise level, unlike Apple. I’m wondering, you provide them with computational power, and at some point — although you mentioned TPU capacity is currently abundant — there may come a time when tough decisions need to be made: should the capacity go to Anthropic, or should it be reserved for Gemini? Or kept for our own research? How would you make that decision?
Thomas Kurian: We have a management team led by Sundar, and we discuss these decisions together, just like any mature company would. There are tough judgments to make every day. For instance, the demand we receive doesn’t only come from Anthropic. So even if you say there’s X amount of capacity reserved for Gemini and Y amount for everyone else, how do you allocate within that Y to Anthropic and hundreds of other labs and customers? These are complex decisions anyone must face. But one thing I can tell you: having your own chips and demands is far better than not having them.
Host: Well said. Mythos is rumored to be the first model at the trillion-parameter scale. Is Google now working on or approaching models at the ten-trillion-parameter scale? Where are you in this development cycle?
Thomas Kurian: Regarding Gemini, you will see our new developments at the Next conference and shortly thereafter. In terms of model capabilities, we are very proud of where Gemini stands. It has long been at the industry-leading level. We have a new version of Gemini coming soon, and based on all the benchmarks we’ve seen, we are equally confident about it.
Host: Hypothetically speaking, if we consider a ten-trillion-parameter model, based on your orchestration at the TPU level, is this a feasible service scale given the current state of technology?
Thomas Kurian: We have long had the capability for disaggregated serving, which allows us to scale very large dense models exceptionally well. This capability has existed for quite some time. Therefore, we don’t design models that we ourselves cannot deploy. We are fully confident that TPUs can serve the largest models in the world. Most importantly, our serving stack for disaggregated deployment achieves the highest efficiency in TPU utilization among all model providers. Hence, we are completely confident in serving the largest models, especially the largest Gemini models.
Host: Does this mean we are not seeing any slowdown on the scaling side of pre-training? Are you not feeling it at all? Because there was a period when the industry discussed a slowdown in pre-training, suggesting a shift in focus toward reinforcement learning and thinking time. Are you not experiencing this at all?
Thomas Kurian: We are not seeing this slowdown at the levels of chip design, system design, or production capacity.
Host: What about the underlying data? Are you seeing more efficient applications of synthetic data?
Thomas Kurian: Let me give you two or three examples of what we’ve actually observed. Historically, most of the data fed into models has been unstructured data such as text, audio, video, and documents, and the volume of this data continues to grow. However, in enterprise scenarios, many elements are relatively straightforward to handle. For example, if you ask an agent a question and request citation sources, and the answer comes from a document, it’s simple—just display a link.
But imagine you ask the model a question: 'Tell me how much inventory I need to stock to meet the demand for this product.' This requires converting the query into one for systems like SAP or a supply chain system, which dynamically queries a set of data tables. First, accurately decomposing this query into corresponding data tables, then presenting the response—where is the citation source? How do you know the answer you gave me is correct? This is a far more complex problem.
Precisely because of our work in the enterprise domain, we are able to feed more structured data loops into Gemini's trajectory optimization training framework, including complex elements such as structured data and intricate fields. For instance, have you ever seen — when discussing computer usage within browsers — an enterprise application with a thousand fields, dropdown lists, and so on? No consumer-grade application comes close to that level of complexity. Our deep engagement in this area also allows us to teach these elements to our Gemini system and integrate them into the training framework.
Host: Let’s continue our discussion about the training framework and agent programming. I’ve been doing a lot of programming myself recently. There was a viral post online claiming that someone had a friend at Google who basically said Google isn’t at the forefront of agent programming internally. What’s your take on this? How is agent programming adopted within Google? And again, I must mention Anthropic — their release speed is breathtaking. How is Google embracing the cutting edge of agent programming?
Thomas Kurian: Currently, a large number of our engineers are using Jet Ski, our internal programming framework, and its feedback is being continuously relayed in real time to DeepMind, forming a closed-loop for reinforcement learning that improves Gemini’s programming quality every day. Many people in my organization are already using it.
Host: One thing I’ve noticed is that I’ve become much more productive than before. I release products very quickly, and the process is enjoyable. I don’t review code line by line. In fact, I review very few lines of code. But Google can’t do that. My projects are small toy projects, while Google deals with high-risk projects, services, and products. How do you stay at the forefront of agent programming, generate massive amounts of code, and at the same time ensure quality and make sure every line of deployed code has been reviewed?
Thomas Kurian: When we talk about software engineering productivity, our perspective differs slightly from external reports. If you work at a company like Google that develops products, the reality is that there are two or three key factors. For example, code written by a senior engineer tends to be far more concise than that written by a junior engineer. Therefore, we don’t measure productivity by lines of code because typically, less skilled engineers need to write more code to accomplish the same task, whereas senior engineers write more succinctly.
Host: That’s almost a cliché statement that has persisted for years, but I think it’s more important now than ever to focus on overall delivery speed.
Thomas Kurian: Yes, what matters is how much functionality we add.
Second, Google has always had a tradition: peer reviews are required when submitting code, usually conducted by senior managers who often become bottlenecks. Therefore, we introduced Gemini, which people are actively using — for example, we recently integrated it into Cloud to scan for security vulnerabilities in code. So this tool is not just for generating code; we also use it to inspect code, which helps complete a significant amount of preliminary work before senior engineers conduct the final review.
Third, in the long term, in any real software company, engineers often find themselves spending most of their inefficient time debugging issues. Thus, we built a specific version of Gemini, and one of the things we will demonstrate next week is this: the world’s most complex computer is the cloud. Compared to it, personal computers are mere toys. We have opened up the entire cloud's capabilities and tools to the model. Now we are using Gemini to troubleshoot ongoing fault incidents, which has also helped improve our efficiency and, in turn, enhanced the quality of the model itself. We approach this issue from multiple dimensions. But as productivity continues to rise and feature iterations accelerate — although lines of code are certainly not the metric, they do reflect this increased speed — we will inevitably reach a tipping point where it becomes impossible to review every single line of code.
Host: Taking this further, over time, humans will understand less and less of the actual code. Especially as you mentioned earlier, if AI is used to review and debug code — if AI is responsible for both generating and reviewing the code, are we losing core understanding of the code itself and the functions being deployed?
Thomas Kurian: This is a risk that the entire industry must manage. People often say: I’ll give you a prompt, and it can generate a piece of code; you don’t need to understand the code because understanding the prompt is sufficient. But the reality is, for a complex system, the prompt cannot explain all potential behaviors of the code. For example, how do you handle exceptions?
Every time this argument arises, it feels familiar to me. A few years ago, some claimed that we wouldn’t need as many software engineers, only for models to later reveal numerous security vulnerabilities—right at this critical juncture, we ended up needing a large number of software engineers to collaborate with the models. For instance, we are launching a new version of a model that can fix vulnerabilities, specifically security vulnerabilities, but you still need someone to use this tool and diligently oversee it. The industry sometimes overcorrects, saying 'we don’t need anyone at all,' precisely when people are most needed.
Therefore, we always maintain a long-term perspective. We continuously reflect on whether we need a 'supervisory model' to review code in different ways—this is why I emphasize that we still adhere to peer reviews of code and assist our senior engineers in using tools to complete reviews. The next question is: does this tool have sufficient self-awareness—if it generates its own code, can it identify issues within it? Because it may lack self-recognition for certain coding patterns. This is the direction in which we are exploring solutions.
Our goal has always been to build the best models and apply them at scale. In my team alone, thousands of people use it every day. If you visit the opposite campus, you’ll see individuals running six windows simultaneously—one writing code, one compiling, one deploying and testing, and another running code review tasks in the background. A large number of people are utilizing this entire toolchain, which represents part of the evolution in working methods.
Host: You mentioned cybersecurity, so let’s wrap up with that topic. Anthropic believes that its Mythos model is too advanced in terms of cybersecurity capabilities and should not be publicly released for now. From Google’s perspective, how do you view this matter? What was your initial reaction? Additionally, is there a certain red line or benchmark that, if crossed, would make you believe Gemini is no longer suitable for public release?
Thomas Kurian: We are studying where this line should be drawn. However, the core issue we face is: what proportion of the vulnerabilities discovered by Mythos could also be identified using open-source models? I mention open-source models because, regardless of precautions, even if you ensure proprietary models don’t fall into adversaries’ hands, open-source models will inevitably reach them, and they continue to evolve, becoming increasingly powerful. Sooner or later, some of these capabilities—though perhaps not all—will be detectable and exploitable.
So how should we respond? We have unique advantages because we are both a hyperscale cloud service provider and a model provider, while also having a cybersecurity team—including our Mandiant team and Wiz. We’ve implemented three concrete measures:
First, if people use models to discover vulnerabilities, then you need models to help fix those vulnerabilities—because vulnerabilities are discovered far faster than humans can repair them, so models must assist in remediation.
Second, if adversaries use models to find vulnerabilities, they will also leverage models and computers to launch large-scale attacks. Facing this threat, conducting red team tests just once a month is insufficient. Therefore, we need to introduce agents capable of continuous red team testing, as well as agents that assist in remediation—fixing code is one thing, but identifying all instances of outdated code, removing them, and redeploying newly patched code is another challenge altogether.
Third, given the vast amount of existing code, where do we begin? That’s another issue. To address this, we’ve built tools to help people identify and prioritize areas for action.
Host: Does this mean that open-source software (note, not open-source models, but open-source software) does more harm than good? If your code is open-sourced, it is fully exposed, allowing models to scan, identify vulnerabilities, and exploit them. Closed-source software doesn’t have this issue. On the other hand, open-source code tends to be fortified more quickly. What’s your take on this? Is this an argument for or against open-source?
Thomas Kurian: At Google, we make extensive use of open source and contribute significantly to it. We assist the open-source community in addressing these issues with our tools. I’m merely stating a reality: adversaries will use models, and the first thing they’ll scan are popular open-source libraries because these provide the largest attack surface. This is a problem we believe must be taken seriously and proactively fixed, and we’re working with other industry partners to advance this effort.
Host: Thomas, one last question: What keeps you up at night?
Thomas Kurian: We need to balance many things. First, do we have long-term plans for the future — data centers, network infrastructure, and a sufficient number of TPUs? Second, are we always focusing on the most critical and essential issues? Three years ago, we anticipated that as AI capabilities grew, the cybersecurity field would be profoundly impacted. When we proposed acquiring Wiz, many people asked: Why are you doing this?
Let me give another example by looking at our Gemini Enterprise platform: From January this year until now, our token processing volume has grown from 10 billion per minute to 16 billion per minute. The number of enterprise users of Gemini Enterprise Edition has increased by 40% quarter-over-quarter.
Therefore, we constantly ask ourselves: Are we solving the right problems for our customers and users? This remains our core focus. As long as we continue to actively address issues and maintain our leadership in the market — in an era where technology evolves so rapidly, when something happens, you must already have a solution ready. Our team has delivered remarkable results, and we are immensely proud of them. We look forward to the upcoming events.
Host: Thomas, thank you very much, truly appreciate it!

Editor/Jeffy

The translation is provided by third-party software.

The above content is for informational or educational purposes only and does not constitute any investment advice related to EleBank. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.