Multiprocessing and Dask

Question

Scenario: I want to perform simple computation on a file (~100 MB). I have thousands of such files. For this example, consider the computation to be "count the number of lines".

Dask: I am using Dask to parallelize read and computation.

What works: If each file is very large (45 GB), it makes sense to dedicate 1 computer (70GB memory) in my cluster (12 computers using PBS) to read and compute.

from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=1,processes=1,memory='70GB',queue='extra',project='abcd',walltime='12:00:00',interface='ib0',local_directory='/home/abcd/dask_workers/')
cluster.scale(jobs=12)
print(cluster.job_script())
from dask.distributed import Client, progress
client = Client(cluster)

Problem: When the file is small (~100 MB), I want 1 computer in the cluster to read and do the computation for multiple files in parallel.

Question: Which parameter will allow 1 computer in the cluster to undertake multiple read and computes, in parallel.

SultanOrazbayev · Answer 1 · 2021-03-22T08:20:47.943

0

How about specifying some arbitrary resource? E.g. specify resources={'compute_load': 12} when the cluster is created, to give each worker 12 units of resource compute_load.

When processing heavy load, assign .compute(resources={'compute_load': 12}), so each worker taken on only one task at a time. When the load is light, use .compute(resources={'compute_load': 1}), and each worker will take up 12 tasks at a time. See also this answer.

edited Mar 22 '21 at 08:20

answered Mar 17 '21 at 13:33

SultanOrazbayev

14,900
3
16
46

I do not think this worked. Here is how I set it up: `cluster = PBSCluster(resources={'compute_load':12},cores=1,memory='10 GB',queue='extra',project='abcd',walltime='10:00:00',interface='ib0',local_directory='/home/abcd/dask_workers/')` `cluster.scale(12)` `client = Client(cluster)` `futures = [client.submit(functionToProcess,inputFile,resources={'compute_load':1}) for eachFile in intake]` – researcher Mar 22 '21 at 02:13
THat looks good except I think you want to use `futures = [client.submit(functionToProcess,inputFile,resources={'compute_load':12}) for eachFile in intake]` with the heavy files. – SultanOrazbayev Mar 22 '21 at 03:20
also note that with a single core per worker, they won't be able to do anything in parallel... – SultanOrazbayev Mar 22 '21 at 03:21
This setup is for light load. Each worker has two processors with 6 cores each. – researcher Mar 22 '21 at 03:32
hmm, `cores=1` means that a single core should be assigned per worker... – SultanOrazbayev Mar 22 '21 at 03:36

Multiprocessing and Dask

1 Answers1

Linked