1

Scenario: I want to perform simple computation on a file (~100 MB). I have thousands of such files. For this example, consider the computation to be "count the number of lines".

Dask: I am using Dask to parallelize read and computation.

What works: If each file is very large (45 GB), it makes sense to dedicate 1 computer (70GB memory) in my cluster (12 computers using PBS) to read and compute.

from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=1,processes=1,memory='70GB',queue='extra',project='abcd',walltime='12:00:00',interface='ib0',local_directory='/home/abcd/dask_workers/')
cluster.scale(jobs=12)
print(cluster.job_script())
from dask.distributed import Client, progress
client = Client(cluster)

Problem: When the file is small (~100 MB), I want 1 computer in the cluster to read and do the computation for multiple files in parallel.

Question: Which parameter will allow 1 computer in the cluster to undertake multiple read and computes, in parallel.

M_x
  • 782
  • 1
  • 8
  • 26
researcher
  • 13
  • 2

1 Answers1

0

How about specifying some arbitrary resource? E.g. specify resources={'compute_load': 12} when the cluster is created, to give each worker 12 units of resource compute_load.

When processing heavy load, assign .compute(resources={'compute_load': 12}), so each worker taken on only one task at a time. When the load is light, use .compute(resources={'compute_load': 1}), and each worker will take up 12 tasks at a time. See also this answer.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • I do not think this worked. Here is how I set it up: `cluster = PBSCluster(resources={'compute_load':12},cores=1,memory='10 GB',queue='extra',project='abcd',walltime='10:00:00',interface='ib0',local_directory='/home/abcd/dask_workers/')` `cluster.scale(12)` `client = Client(cluster)` `futures = [client.submit(functionToProcess,inputFile,resources={'compute_load':1}) for eachFile in intake]` – researcher Mar 22 '21 at 02:13
  • THat looks good except I think you want to use `futures = [client.submit(functionToProcess,inputFile,resources={'compute_load':12}) for eachFile in intake]` with the heavy files. – SultanOrazbayev Mar 22 '21 at 03:20
  • also note that with a single core per worker, they won't be able to do anything in parallel... – SultanOrazbayev Mar 22 '21 at 03:21
  • This setup is for light load. Each worker has two processors with 6 cores each. – researcher Mar 22 '21 at 03:32
  • hmm, `cores=1` means that a single core should be assigned per worker... – SultanOrazbayev Mar 22 '21 at 03:36