I have a case where I need to read multiple CSVs from S3, and store each separately as a dataframe in a list of dataframes. When I read each CSV one-by-one, it works. I'm trying to read them in parallel to speed things and and tried to recreate the parallel process in this answer. However, when I do this, the process just hangs. What might be wrong? Is there something in dask
that doesn't allow this to work?
# Load libraries
import pandas as pd
import dask.dataframe as dd
from multiprocessing import Pool
# Define function
def read_csv(table):
path = 's3://my-bucket/{}/*.csv'.format(table)
df = dd.read_csv(path, assume_missing=True).compute()
return df
# Define tables
tables = ['sales', 'customers', 'inventory']
# Run function to read one-by-one (this works)
df_list = []
for t in tables:
df_list.append(read_csv(t))
# Try to run function in parallel (this hangs, never completes)
with Pool(processes=3) as pool:
df_list = pool.map(read_csv, tables)