I've noticed a significant performance degradation with the following script when increasing my cluster size: from a single node to 3 node cluster. Running times are 2min and 6min respectively. Also, noticed CPU activity is very low in both cases.
The task here is a word count over a single text file (10GB, 100M lines, ~2B words). I've made the file available to all nodes prior to launching the script.
What could possibly impede Dask from scaling this out?
from dask.distributed import Client
import dask.dataframe as dd
df = dd.read_csv(file_url, header=None)
# count words
new_df = (
df[0]
.str
.split()
.explode()
.value_counts()
)
print(new_df.compute().head())