This session provides an overview of the Keysight AI fabric test methodology, demonstrating key findings and improvements achieved through automated testing and the search for optimal configuration parameters. Alex Bortek, Lead Product Manager at Keysight Technologies, introduces the Keysight AI fabric test methodology using the Kai Data Center Builder product. The methodology guides users through the phases of designing and building an AI fabric, emphasizing the importance of topology selection, collective operation algorithms, performance isolation, load balancing, and congestion control. The methodology and related white papers are available for download via a QR code or the link below
The presentation delves into key terminology, including collective operations (broadcast, all-reduce, all-to-all), ranks, collective size, and data size. Metrics such as collective completion time, algorithm bandwidth, and bus bandwidth are defined and used to measure performance. Alex explains how bus bandwidth is a beneficial metric as it removes the number of GPUs from the equation and specifies the limiting factor that defines how long the collective operation will take. A testbed comprising four 800-Gbps port speed switches is described, emulating 16 GPUs/network cards running at 400 Gbps to assess fabric performance.
A demonstration highlights the impact of congestion control on network performance. By comparing scenarios with and without congestion control enabled, the presentation illustrates how fine-tuning DCQCN parameters can optimize bandwidth utilization and reduce congestion. The speaker uses the tool to showcase testing of different settings on the fabric to achieve the optimal configuration. The presentation concludes by mentioning Ultra Ethernet consortium membership and upcoming webinars detailing Keysight's innovations in AI.
Presented by Alex Bortok, Lead Product Manager, AI Data Center Solutions, Keysight Technologies. Recorded live in Santa Clara, California on April 25, 2025 as part of AI Infrastructure Field Day. Watch the entire presentation at https://techfieldday.com/appearance/keysight-presents-at-ai-infrastructure-field-day-2/, https://techfieldday.com/event/aiifd2/ or https://www.keysight.com/us/en/assets/3124-1729/application-notes/AI-Fabric-Test-Methodology.pdf for more information.
Up Next in AI Infrastructure Field Day 2
-
Introduction to the AI Hypercomputer ...
Sean Derrington, Product Manager, Storage at Google Cloud, introduced the AI Hypercomputer at AI Infrastructure Field Day, highlighting Google Cloud's investments in making it easier for customers to consume and run their AI workloads. The focus is on infrastructure with consideration to the cons...
-
Storage Intelligence with Google Cloud
Manjul Sahay, Group Product Manager at Google Cloud Storage, presented on Storage Intelligence with Google Cloud, focusing on helping customers, both enterprises and startups, manage their storage effectively for AI applications. These customers often face challenges in managing storage at scale ...
-
AI Hypercomputer Cluster Toolkit with...
Ilias Katsardis, Senior Product Manager for AI infrastructure at Google Cloud, presented on the AI Hypercomputer Cluster Toolkit, addressing the complexities of deploying AI infrastructure on Google Cloud's compute engine and GKE. He highlighted the challenges customers face when trying to quickl...