Cisco AI Cluster Design, Automation, and Visibility
AI Infrastructure Field Day 4
•
31m
Cisco's presentation on AI Cluster Design, Automation, and Visibility, led by Meghan Kachhi and Richard Licon, aims to simplify AI infrastructure and address the challenges of lengthy design and troubleshooting cycles for GPU clusters. The core focus is on enhancing cluster designs, automating deployments, and providing end-to-end visibility to protect a competitive edge. The session outlines Cisco's reference architectures, key components for building AI clusters, and upcoming updates to its Nexus Dashboard platform, which is expected to streamline design, automation, and monitoring at scale. This comprehensive approach is crucial because the battle for AI success lies at the infrastructure layer, ensuring GPUs are not underutilized by network inefficiencies.
Cisco leverages three unique pillars in its AI networking strategy. Firstly, its systems feature custom Silicon One platforms, offering programmable pipelines that quickly adapt to evolving AI infrastructure demands, and a partnership with NVIDIA that provides NX-OS on NVIDIA Spectrum X silicon for full-stack reference architecture compliance. Rigorously tested transceivers and a mature NX-OS software, now optimized for AI workloads, complete the system offerings. Secondly, the operating model includes the Nexus Dashboard for on-premises management and Nexus Hyperfabric for a full-stack, cloud-managed solution, complemented by an API-first approach to seamless integration with existing customer automation frameworks. Thirdly, extensive AI reference architectures serve as validated blueprints, spanning enterprise-scale deployments (under 1024 GPUs) to hyperscale cloud environments (1K-16K+ GPUs), providing detailed component lists and ensuring a consistent networking experience across vendors such as NVIDIA, AMD, and storage solutions. An AI cluster is broadly defined to encompass front-end, storage, and backend GPU-to-GPU networks, with a growing trend toward convergence enabled by high-speed Ethernet to unify operating models.
Designing an efficient AI backend network requires a non-blocking architecture that maintains a 1:1 subscription ratio, keeping every GPU within one hop of others for optimal communication. Cisco employs a "scalable unit" concept, enabling incremental expansion by repeating validated blocks while adjusting spine-layer connectivity to maintain high performance. For smaller-scale deployments, such as a 32-GPU university cluster, Cisco demonstrates how front-end, storage, and backend networks can be converged onto fewer, high-density switches, simplifying infrastructure. A critical consideration for such converged environments is Cisco's policy-based load balancing, an innovation leveraging Silicon One ASICs. This enables preferential treatment of critical traffic, such as GPU-to-GPU training, over storage or front-end traffic, ensuring AI jobs run with minimal latency and maximum GPU utilization, even when sharing network resources.
Presented by Meghan Kachhi, Technical Marketing Engineering Technical Leader, Cisco and Richard Licon, Principal Technical Marketing Engineer, Cisco. Recorded live at AI Infrastructure Field Day in Santa Clara on January 28th, 2026. Watch the entire presentation at https://techfieldday.com/appearance/cisco-data-center-networking-presents-at-ai-infrastructure-field-day/ or visit https://techfieldday.com/event/aiifd4/ or https://www.cisco.com/ for more information.
Up Next in AI Infrastructure Field Day 4
-
Cisco Reference Architectures for AI ...
Cisco provides comprehensive reference architectures for AI networking, scalable from small 96-GPU clusters up to massive 32,000-GPU deployments. These designs, available on Cisco.com and Nvidia.com, are vendor-agnostic, supporting Nvidia, AMD, and Intel. The core focus is to simplify operations ...
-
Cisco Enterprise Networking Vision, S...
Cisco presents its enterprise networking vision and strategy, detailing how it is executed from a platform perspective, particularly in the context of the rapidly evolving AI era. Kiran Ghodgaonkar, who leads product marketing for Cisco's Secure WAN portfolio, introduced the session and outlined ...
-
Secure Routing for AI with Cisco Ente...
Secure Routing with Cisco Enterprise Networking tackles the increasing complexity, user experience demands, and security requirements of modern WAN networks, especially with the advent of AI branches. Rahul Sagi introduced Cisco Secure Routers, launching in 2025, designed to converge Cisco's best...