Introduction to Cisco AI Cluster Networking Design with Paresh Gupta
Networking Field Day 39
•
12m
Paresh Gupta, a principal engineer at Cisco focusing on AI infrastructure, began by outlining the diverse landscape of AI adoption, which spans from hyperscalers with hundreds of thousands of GPUs to enterprises just starting with a few hundred. He categorized these environments by scale—scale-up within a server, scale-out across servers, and scale-across between data centers—and by use case, such as foundational model training versus fine-tuning or inferencing. Gupta emphasized that the solutions for these different segments must vary, as the massive R&D budgets and custom software of a hyperscaler are not available to an enterprise, which needs a simpler, more turnkey solution.
Gupta then deconstructed the modern AI cluster, starting with the immense computational power of GPU servers, which can now generate 6.4 terabits of line-rate traffic per server. He detailed the multiple, distinct networks required, highlighting a recent shift in best practices: the front-end network and the storage network are now often converged. This change is driven by cost savings and the realization that front-end traffic is typically low, making it practical to share the high-bandwidth 400-gig fabric. This converged network is distinct from the inter-GPU backend network, which is dedicated solely to GPU-to-GPU communication for distributed jobs, as well as a separate management network and potentially a backend storage network for specific high-performance storage platforms.
Finally, Gupta presented a simplified, end-to-end traffic flow to illustrate the complete operational picture. A user request does not just hit a GPU; it first traverses a standard data center fabric, interacts with applications and centralized services like identity and billing, and only then reaches the AI cluster's front-end network. From there, the GPU node may access high-performance storage, standard storage for logs, or organization-wide data. If the job is distributed, it ignites the inter-GPU backend network. This complete flow, he explained, is crucial for understanding that solving AI networking challenges requires innovations at every point of entry and exit, not just in the inter-GPU backend.
Presented by Paresh Gupta, Principal Technical Marketing Engineer. Recorded live at Networking Field Day 39 in Silicon Valley on November 6, 2025. Watch the entire presentation at https://techfieldday.com/appearance/cisco-presents-at-networking-field-day-39/ or visit https://techfieldday.com/event/nfd39/ or https://Cisco.com for more information.
Up Next in Networking Field Day 39
-
The Age of Operations: Third Party Vi...
Scott Robohn of the consulting firm Solutional provided a third-party, operational perspective on data center innovation, based on a collaboration with The Futurum Group and Nokia. He introduced his background as a former network engineer and Tech Field Day delegate, now focused on NetOps and AI ...
-
Cisco AI Cluster Design and Operation...
Arun Anavarpu, Director of Product Management for Cisco's Data Center Networking Group, opened the presentation by framing the massive industry shift towards AI. He noted that the evolution from LLMs to agentic AI and edge inferencing creates an AI continuum that places unprecedented demands on t...
-
Cisco AI Networking Cluster Operation...
Paresh Gupta's deep dive on AI cluster operations focused on the extreme and unique challenges of high-performance backend networks. He explained that these networks, which primarily use RDMA over Converged Ethernet (ROCE), are exceptionally sensitive to both packet loss and network delay. Becaus...