I'm new to Dask. My motivation was to read large CSV files faster by parallelizing the process. After reading a file, I use compute()
in order to merge the parts into a single pandas df. Then, when using pandas to_csv
, the output CSV file isn't readable:
$ file -I *.csv
my_big_file.csv: ERROR: cannot read `my_big_file.csv' (Operation canceled)
$ head -n2 my_big_file.csv
head: Error reading my_big_file.csv
Original code looks like the following:
import pandas as pd
import dask.dataframe as daf
filepath='/Users/coolboy/Customer Data/my_original_file.csv'
df = daf.read_csv(filepath, dtype=str, low_memory=False, encoding='utf-8-sig',error_bad_lines=False).compute()
print('done reading')
df.to_csv('/Users/coolboy/Customer Data/my_big_file.csv',index=False)