Dan Eawaz, Senior Product Manager at Google Cloud, introduced Managed Lustre with Google Cloud, a fully managed parallel file system built on DDN Exascaler. The aim is to solve the demanding requirements of data preparation, model training, and inference in AI workloads. Managed Lustre provides high throughput to keep GPUs and TPUs fully utilized and enables quick writing and reading for checkpoints.
Currently, many customers leverage parallel file systems (PFSs) like Lustre on-prem. Google Cloud Managed Lustre makes it easier for customers to bring their workloads to the cloud without re-architecting. It optimizes TCO by maximizing the utilization of expensive GPUs and TPUs. The offering is a persistent service deployed co-located with compute for optimal latency, scaling from 18 terabytes to petabyte scale, with sub-millisecond latency and an initial throughput of one terabyte per second.
The service is managed, where customers specify their region, capacity, and throughput needs. Google then deploys the capacity in the background, providing a mount point for easy integration with GCE or GKE. The Google Cloud Managed Luster service has a 99.9% availability SLA in a single zone and is fully POSIX compliant. The service integrates with GKE via a CSI driver and supports Slurm through the cluster toolkit. It also has an integration built for data batch transfer to and from Google Cloud Storage.
Presented by Dan Eawaz, Senior Product Manager, Google Cloud Managed Lustre, Google Cloud. Recorded live in Santa Clara, California, on April 22, 2025, as part of AI Infrastructure Field Day. Watch the entire presentation at https://techfieldday.com/appearance/google-cloud-presents-at-ai-infrastructure-field-day-2/ or https://techfieldday.com/event/aiifd2/ for more information.
Up Next in AI Infrastructure Field Day 2
-
The latest in high-performance storag...
Michal Szymaniak, Principal Engineer at Google Cloud, presented on Rapid Storage, a new zonal storage product within the cloud storage portfolio, powered by Google's foundational distributed file system, Colossus. The goal in designing Rapid Storage was to create a storage system that offers the ...
-
Analytics Storage and AI, Data Prep a...
Vivek Sarswat, Group Product Manager at Google Cloud Storage, presented on analytics storage and AI, focusing on data preparation and data lakes. He emphasized the close ties between analytics and AI workloads, highlighting key innovations built to address related challenges. The presentation dem...
-
AI hypercomputer and GPU acceleration...
Dennis Liu, a Product Manager at Google Cloud specializing in GPUs, presented on AI hypercomputer and GPU acceleration with Google Cloud. Liu covered Google Cloud's AI hypercomputer, from consumption models to purpose-built hardware. Focus was given to Google's cluster director for managing GPU f...