AI Infrastructure Field Day 2
Building Trust at Scale. How Crusoe Validates Network Infrastructure for AI Workloads with Keysight
38m
In this session, Crusoe shares how they are actively testing frontend networks and inter-VM/host data transfers that feed their GPU clusters. By validating the performance, reliability, and scalability of its infrastructure early, Crusoe aims to identify and resolve issues internally, minimizing the chance that end customers will discover them first. This is a differentiator for them, which enables a more robust, production-ready AI platform. Crusoe is a vertically aligned AI infrastructure company powered by sustainable energy sources, including wind, solar, and geothermal. They build AI data centers, with a large project underway in Abilene, Texas.
Crusoe's AI cloud platform offers infrastructure as a service, where customers consume GPU supercomputing via virtualized machines. They also provide managed AI solutions like AI as a service, inference, and workloads. Their mission is to build the world's favorite AI cloud, purpose-built for AI, with enterprise-scale infrastructure. The company focuses on the design and engineering of data center networks, software-defined networking, and GPU-to-GPU fabrics, all optimized using NVIDIA reference architectures. They emphasize customer support, offering 24/7 assistance to address GPU systems' complexities and potential issues.
Crusoe partners with Keysight to conduct rigorous testing to ensure optimal performance and stability, particularly focusing on stateful traffic and high connection rates. They simulate various workloads to stress the system and identify breaking points, provide deterministic performance, and prevent noisy neighbor issues in their multi-tenant environment. This proactive approach allows Crusoe to understand the system's limits and provide transparent performance data to customers, ensuring a world-class service and preventing users from becoming beta testers. They use Cyperf as a traffic generator to understand the behavior of open-source OVS and NVIDIA's stack to optimize testing. Plans include incorporating Blackwell platforms, advancing telemetry and monitoring, and focusing on storage optimization, scale, and security.
Presented by Gavin McKee, Cloud Network Infrastructure Architect AI/ML/HPC, Crusoe. Recorded live in Santa Clara, California on April 25, 2025 as part of AI Infrastructure Field Day. Watch the entire presentation at https://techfieldday.com/appearance/keysight-presents-at-ai-infrastructure-field-day-2/, https://techfieldday.com/event/aiifd2/ or https://www.keysight.com/us/en/assets/3125-1157/application-notes/High-Performance-Networking-Offloads-for-AI-ML-Focused-Cloud-Platforms.pdf for more information.
Up Next in AI Infrastructure Field Day 2
-
Maximizing the Performance of AI Back...
This session provides an overview of the Keysight AI (KAI) Data Center Builder solution and how it supports each phase of AI data center design and deployment with actionable data to improve performance and increase the reliability of AI clusters. The presentation explains how KAI Data Center Bui...
-
Demonstrating Keysight's AI Fabric Te...
This session provides an overview of the Keysight AI fabric test methodology, demonstrating key findings and improvements achieved through automated testing and the search for optimal configuration parameters. Alex Bortek, Lead Product Manager at Keysight Technologies, introduces the Keysight AI ...
-
Introduction to the AI Hypercomputer ...
Sean Derrington, Product Manager, Storage at Google Cloud, introduced the AI Hypercomputer at AI Infrastructure Field Day, highlighting Google Cloud's investments in making it easier for customers to consume and run their AI workloads. The focus is on infrastructure with consideration to the cons...