Scenario: I want to perform simple computation on a file (~100 MB). I have thousands of such files. For this example, consider the computation to be "count the number of lines".
Dask: I am using Dask to parallelize read and computation.
What works: If each file is very large (45 GB), it makes sense to dedicate 1 computer (70GB memory) in my cluster (12 computers using PBS) to read and compute.
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=1,processes=1,memory='70GB',queue='extra',project='abcd',walltime='12:00:00',interface='ib0',local_directory='/home/abcd/dask_workers/')
cluster.scale(jobs=12)
print(cluster.job_script())
from dask.distributed import Client, progress
client = Client(cluster)
Problem: When the file is small (~100 MB), I want 1 computer in the cluster to read and do the computation for multiple files in parallel.
Question: Which parameter will allow 1 computer in the cluster to undertake multiple read and computes, in parallel.