AI Infrastructure Field Day 2
Overview of Cloud Storage Storage for AI, Lustre, GCSFuse, and Anywhere cache with Google Cloud
34m
Marco Abela, Product Manager at Google Cloud Storage, presented an overview of Google Cloud's storage solutions optimized for AI/ML workloads. The presentation addressed the critical role of storage in AI pipelines, emphasizing that an inadequate storage solution can significantly bottleneck GPU utilization, causing idle GPUs and hindering data processing from initial data preparation to model serving. He highlighted two industry-optimized storage types: object storage (Cloud Storage) for persistent, high-throughput storage with virtually unlimited capacity, and parallel file systems (Managed Luster) for ultra-low latency, catering to specific workload profiles. The typical storage requirements for AI/ML involve vast capacity, high aggregate throughput, millions of requests per second (QPS/IOPS), and low-latency reads, with varying performance aspects across different training profiles.
The presentation further detailed Cloud Storage Fuse, a solution enabling the mounting of a bucket as a local file system. Abela highlighted its heavy investment and significant payoff, addressing the need for file system semantics without rewriting applications for object storage. Cloud Storage Fuse now serves as a high-performance client with features like file cache, parallel download, streaming writes, and Hierarchical Namespace bucket integration. The file cache improves training times, while the parallel download feature drastically speeds up model loading, achieving up to 9x faster load times than FSSpec. Hierarchical namespace buckets offer atomic folder renames for checkpointing, resulting in 30x faster performance.
Abela then introduced Anywhere Cache, a newly GA feature designed to improve performance by co-locating storage on SSD in the same zone as compute. This "turbo button" for Cloud Storage simplifies usage, requiring no code refactoring while reducing time to first byte latency by up to 70% for regional buckets and 96% for multi-regional buckets. A GenAI customer case study demonstrated its effectiveness in model loading, achieving a 99% cache hit rate, eliminating tail latencies, and reducing network egress costs using multi-regional buckets. The presentation also detailed a recommender tool that helps users understand the cacheability of their workload, optimal configuration, throughput, and potential cost savings.
Presented by Marco Abela, Product Manager, Google Cloud Storage, Google Cloud. Recorded live in Santa Clara, California, on April 22, 2025, as part of AI Infrastructure Field Day. Watch the entire presentation at https://techfieldday.com/appearance/google-cloud-presents-at-ai-infrastructure-field-day-2/ or https://techfieldday.com/event/aiifd2/ for more information.
Up Next in AI Infrastructure Field Day 2
-
Intro to Managed Lustre with Google C...
Dan Eawaz, Senior Product Manager at Google Cloud, introduced Managed Lustre with Google Cloud, a fully managed parallel file system built on DDN Exascaler. The aim is to solve the demanding requirements of data preparation, model training, and inference in AI workloads. Managed Lustre provides h...
-
The latest in high-performance storag...
Michal Szymaniak, Principal Engineer at Google Cloud, presented on Rapid Storage, a new zonal storage product within the cloud storage portfolio, powered by Google's foundational distributed file system, Colossus. The goal in designing Rapid Storage was to create a storage system that offers the ...
-
Analytics Storage and AI, Data Prep a...
Vivek Sarswat, Group Product Manager at Google Cloud Storage, presented on analytics storage and AI, focusing on data preparation and data lakes. He emphasized the close ties between analytics and AI workloads, highlighting key innovations built to address related challenges. The presentation dem...