1

Google Compute Engine support RAM disks - see here.
I develop a project that will reuse existing code which manipulate local files.
For scalability, I am going to use Dataflow.
The files are in GCS, and I will send it to the Dataflow workers for manipulation.
I was thinking on creating better performance by using RAM disks on the workers, copy the files from GCS directly to the RAM disk, and manipulate it there.
I fail to find any example of such capability.

Is this a valid solution, or should I avoid this kind of a "trick" ?

Shushu
  • 774
  • 6
  • 19

2 Answers2

1

It is not possible to to use ramdisk as the disk type for the workers since ramdisk is being set up on an OS level. The only available disk for the workers are Standard persistent disks (pd-standard), and SSD persistent disks (pd-ssd). Among these, SSD is definitely faster. You can try adding more workers or using a faster CPU to process your data faster.

For comparison I tried running a job that uses standard and ssd and it turns out that it is 13% faster when using SSD compared to standard disk. But take note that I just tested the quick start from the dataflow docs.

Using SSD (3m 54s elapsed time):

enter image description here

Using Standard Disk (4m 29s elapsed time):

enter image description here

Ricco D
  • 6,873
  • 1
  • 8
  • 18
1

While what you want to do might be technically possible by creating a setup.py with custom commands, it will not help you increase performance. Beam already uses as much of the workers' RAM as it can in order to perform effectively. If you are reading a file from GCS and operating on it, then that file is already going to be loaded into RAM. By earmarking a big chunk of RAM to a ramdisk, you will probably make Beam run slower, not faster.

If you just want stuff to happen faster, try using SSD, increase the # of workers, or try using the c2 machine family.

Travis Webb
  • 14,688
  • 7
  • 55
  • 109
  • What do you mean by "that file is already going to be loaded into RAM" ? The code I am trying to implement via Dataflow handles local files, so I will copy the file from GCS to local storage. Does Dataflow "automagically" use a kind of RAM drive ? – Shushu Jan 17 '21 at 08:11
  • You shouldn't be manually copying files to disk, use the built-in `beam.io.Read()` (or `GcsIO()` directly if you need to read binary files) to read files from GCS. It has all kinds of optimizations to parallelize file reads and buffer the file in RAM that you don't need to implement yourself. – Travis Webb Jan 17 '21 at 19:29
  • like, if you are manually installing and using the `google-cloud-storage` python module, don't do that. Everything you need to read files from GCS is built into Beam. – Travis Webb Jan 17 '21 at 19:31