0

We are currently using GCS Fuse with Google Cloud Storage during our training and are seeing very slow performance. The bug seems to be with Google and they are actively working on the Fuse Bug.

I was wondering if someone has tried setting up an NFS Share for Custom Training on Vertex AI? Any ideas what kind of performance benefit would that provide?enter link description here

1 Answers1

1

NFS will indeed improve the performance in terms of file access latency compared to Cloud Storage. however, the access to NFS (Filestore) is more difficult than Cloud Storage. You must have a VM to access the private IP of Filestore.


However, you could continue to use Cloud Storage, it's a common and recommended pattern. But you have to follow some best practices:

  • The bucket must be in the same region as your training process, to minimize the network latency, and do not pay egress fees
  • The bucket object read must be done only at the beginning of each epoch. i.e. download the objects, store them locally (in a local disk, in memory,...), and perform your training loops. You must not access the object through GCSfuse as a local file system during the training loop, else, it's too slow.
guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
  • My issue is that I need access to the Data during the training, and it is too large to fit in memory. Does this mean I should use a larger machine and download all the data? I need access to all the data during my ML training. – gggggggggggggggg Aug 21 '23 at 18:56
  • Do you need all the data for each epoch? I have a project where I train model on videos, and I have 'only' a disk of 1.5Tb for the training. I load the data required for my training loops, then shuffle them and download another part of them (randomly, of course). Like that, i have a low latency access to my data during the epoch, and it's only when I need to re-shuffle the data I load the new ones. – guillaume blaquiere Aug 21 '23 at 19:11
  • @guillaumeblaquiere Is it much faster than regular data loaders? Is the time you need to wait at the beginning of each epoch lower than cumulative time it would take to load the data directly from GCS? I guess that sequential access is probably faster than fetching individual items and you can do it in parallel, but separate worker processes should have similar effect. Not the first time I'd be surprised with data access though. – pkubik Aug 29 '23 at 14:37