2

I'm using a Python package called Ray to run the example shown below in parallel. The code is run on a machine with 80 CPU cores and 4 GPUs.

import ray
import time

ray.init()

@ray.remote
def squared(x):
    time.sleep(1)
    y = x**2
    return y

tic = time.perf_counter()

lazy_values = [squared.remote(x) for x in range(1000)]
values = ray.get(lazy_values)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(f'{values[:5]} ... {values[-5:]}')

ray.shutdown()

Output from the above example is:

Elapsed time 13.09 s
[0, 1, 4, 9, 16] ... [990025, 992016, 994009, 996004, 998001]

Below is the same example, but I would like to run it on the GPU using the num_gpus parameter. The GPUs available on the machine are Nvidia Tesla V100.

import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=1)
def squared(x):
    time.sleep(1)
    y = x**2
    return y

tic = time.perf_counter()

lazy_values = [squared.remote(x) for x in range(1000)]
values = ray.get(lazy_values)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(f'{values[:5]} ... {values[-5:]}')

ray.shutdown()

The GPU example never completed and I terminated it after several minutes. I checked the resources available to Ray using import ray; ray.init(); ray.available_resources() and it reports 80 CPUs and 4 GPUs. So it seems that Ray knows about the available GPUs.

I modified the GPU example to run fewer executions by changing range(1000) to range(10). See the revised example below.

import ray
import time

ray.init(num_gpus=1)

@ray.remote(num_gpus=1)
def squared(x):
    time.sleep(1)
    y = x**2
    return y

tic = time.perf_counter()

lazy_values = [squared.remote(x) for x in range(10)]
values = ray.get(lazy_values)

toc = time.perf_counter()

print(f'Elapsed time {toc - tic:.2f} s')
print(f'{values[:5]} ... {values[-5:]}')

ray.shutdown()

The output from the revised GPU example is:

Elapsed time 10.06 s
[0, 1, 4, 9, 16] ... [25, 36, 49, 64, 81]

The revised GPU example completed, but it looks like Ray is not using the GPU in parallel. Is there something else I should do to get Ray to run on the GPU in parallel?

wigging
  • 8,492
  • 12
  • 75
  • 117
  • You can probably try this approach: https://stackoverflow.com/a/77025640/5762249 However, you should keep in mind that using Ray for trivial workloads might even damage the performance, as stated in the "Avoid tiny tasks" section here: https://rise.cs.berkeley.edu/blog/ray-tips-for-first-time-users/ – Kuzman Belev Sep 01 '23 at 19:53

1 Answers1

2
@ray.remote(num_gpus=1)

That tells ray that your function will consume the entire GPU. Thus, it runs serially. The documentation says you should specify a fractional number here to get multiprocessing:

@ray.remote(num_gpus = 0.1)

https://docs.ray.io/en/latest/using-ray-with-gpus.html

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • Ah, I see. So in the example where I have 1,000 tasks defined as `range(1000)`, what would be an appropriate value for `num_gpus` using an Nvidia Tesla V100? The specs for the GPU are 640 tensor cores and 5,120 CUDA cores. – wigging Jan 25 '22 at 21:18
  • Would 1 / N be a good place to start for a `num_gpus` value? Where N would be the number of tasks that are run. For example if N = 1000 then 1 / N = 0.001 therefore `num_gpus=0.001`. – wigging Jan 26 '22 at 16:30
  • I think this is a matter for experimentation. You just want to avoid oversubscribing the memory. And if you have 4 GPUs, why not tell that to `ray.init`? – Tim Roberts Jan 26 '22 at 18:50
  • I just want to use one of the GPUs to conserve resources which is why I used `ray.init(num_gpus=1)`. But yeah, I could tell Ray to use all the GPUs too. I just didn't understand how to properly utilize the function on the GPU. So thank you for the answer. – wigging Jan 26 '22 at 21:29
  • Sounds like you want use placement_groups. They will reserve a certain amount of resources that your task cannot exceed. – Nima Mousavi Dec 17 '22 at 12:18