AI Infrastructure Field Day 2
Maximize AI Cluster Performance using Juniper Self-Optimizing Ethernet with Juniper Networks
36m
Vikram Singh, Sr. Product Manager, AI Data Center Solutions at Juniper Networks, discussed maximizing AI cluster performance using Juniper's self-optimizing Ethernet fabric. As AI workloads scale, high GPU utilization and minimized congestion are critical to maximizing performance and ROI. Juniper’s advanced load balancing innovations deliver a self-optimizing Ethernet fabric that dynamically adapts to congestion and keeps AI clusters running at peak efficiency.
The presentation addressed the unique challenges posed by AI/ML traffic, which is primarily UDP-based with low entropy, bursty flows, and the synchronous compute nature of data parallelism, where GPUs must synchronize gradients after each iteration. This synchronization makes job completion time a key metric, as delays in a single flow can idle many GPUs. Traditional Ethernet, designed for TCP in-order delivery requirements, doesn't efficiently handle this type of traffic, leading to congestion and performance degradation. Solutions like packet spraying using specialized NICs or distributed scheduled fabrics are expensive and proprietary.
Juniper offers an open, standards-based approach using Ethernet, called AI load balancing, which includes dynamic load balancing (DLB) that enhances static ECMP by tracking link utilization and buffer pressure at microsecond granularity to make informed forwarding decisions. DLB operates in flowlet mode (breaking flows into subflows based on configurable pauses) or packet mode (packet spraying). Global Load Balancing (GLB) enhances DLB by exchanging link quality data between leaves and spines, enabling leaves to make more informed decisions and avoid congested paths. Juniper's RDMA-aware load balancing (RLB) uses deterministic routing by assigning IP addresses to subflows, eliminating randomness and ensuring consistent high performance, in-order delivery, and non-rail performance without expensive hardware.
Presented by Vikram Singh, Sr. Product Manager, AI Data Center Solutions, Juniper Networks. Recorded live in Santa Clara, California, on April 23, 2025, as part of AI Infrastructure Field Day. Watch the entire presentation at https://techfieldday.com/appearance/juniper-networks-presents-at-ai-infrastructure-field-day-2/or https://techfieldday.com/event/aiifd2/ for more information.
Up Next in AI Infrastructure Field Day 2
-
Securing AI Clusters, Juniper’s Appro...
AI clusters are high-value targets for cyber threats, requiring a defense-in-depth strategy to safeguard data, workloads, and infrastructure. Kedar Dhuru highlighted how Juniper's security portfolio provides end-to-end protection for AI clusters, including secure multitenant environments, without...
-
GPYOU: Building and Operating your AI...
AI infrastructure is a critical but complex domain, and IT organizations face the pressure to deliver results quickly. Juniper Networks shows Juniper Apstra as a solution to streamline the management of AI data centers, providing proven designs. Kyle Baxter emphasizes the necessity of a robust ne...
-
Day 0: Designing your AI data center ...
Juniper Networks' presentation at AI Infrastructure Field Day focuses on designing AI data centers using Apstra, specifically emphasizing rail-optimized designs and highlighting Apstra's ability to create a fully functional network architecture in just minutes, incorporating native modeling for t...