1

I have a case where I need to read multiple CSVs from S3, and store each separately as a dataframe in a list of dataframes. When I read each CSV one-by-one, it works. I'm trying to read them in parallel to speed things and and tried to recreate the parallel process in this answer. However, when I do this, the process just hangs. What might be wrong? Is there something in dask that doesn't allow this to work?

# Load libraries
import pandas as pd
import dask.dataframe as dd
from multiprocessing import Pool

# Define function    
def read_csv(table):
    path = 's3://my-bucket/{}/*.csv'.format(table)
    df = dd.read_csv(path, assume_missing=True).compute()
    return df

# Define tables
tables = ['sales', 'customers', 'inventory']

# Run function to read one-by-one (this works)
df_list = []
for t in tables:
    df_list.append(read_csv(t))

# Try to run function in parallel (this hangs, never completes)
with Pool(processes=3) as pool:
    df_list = pool.map(read_csv, tables)
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
  • 1
    I am not sure if I have seen that before... If you comment out `df_list = []` and the for loop, the code executes without a problem. I am as lost as you are... – Patol75 Dec 06 '19 at 09:59
  • How many CPUs do you have? – em_bis_me Dec 06 '19 at 12:09
  • Thanks, commenting out `df_list = []` and the for loop works. Still baffles me as to why. I have 96 CPUs, so that's not the problem. – Gaurav Bansal Dec 06 '19 at 14:41

1 Answers1

1

It is odd that you're trying to nest Dask inside of another parallel solution. This is likely to result in suboptimal performance. Instead, if you're looking to use processes, I recommend that you change Dask's default scheduler to multiprocessing, and then just use dd.read_csv as normal.

dfs = [dd.read_csv(...) for table in tables]
dfs = dask.compute(dfs, scheduler="processes")

For more on Dask's schedulers, see https://docs.dask.org/en/latest/scheduling.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • That works, though for my purposes at least, using "processes" takes more than 3 times the amount of time as just reading each table one by one. Using "threads" takes an equivalent amount of time as reading each table one by one. – Gaurav Bansal Dec 08 '19 at 15:47
  • The relevant costs are explained in the doc that I linked to above. – MRocklin Dec 09 '19 at 16:02