0

I'm new to Dask. My motivation was to read large CSV files faster by parallelizing the process. After reading a file, I use compute() in order to merge the parts into a single pandas df. Then, when using pandas to_csv, the output CSV file isn't readable:

$ file -I *.csv
my_big_file.csv:               ERROR: cannot read `my_big_file.csv' (Operation canceled)
$ head -n2 my_big_file.csv
head: Error reading my_big_file.csv

Original code looks like the following:

import pandas as pd
import dask.dataframe as daf

filepath='/Users/coolboy/Customer Data/my_original_file.csv'
df = daf.read_csv(filepath, dtype=str, low_memory=False, encoding='utf-8-sig',error_bad_lines=False).compute() 
print('done reading')
df.to_csv('/Users/coolboy/Customer Data/my_big_file.csv',index=False)
goidelg
  • 316
  • 2
  • 16
  • fwiw, if you call `.compute()` the result is a pandas.DataFrame, not dask.Dataframe. So dask isn't the one writing to csv. Any chance you could create a [mre]? I've got to say I'm a bit skeptical of this one... `pd.to_csv` is a pretty well-tested function. Maybe you have a permissions or storage error or something? – Michael Delgado May 30 '22 at 17:38
  • Since you will be concatenating the bits into a single in-memory df, `dd.read_csv(..).compute()` will not be any faster than pandas alone. – mdurant May 30 '22 at 17:51
  • @mdurant isn't the reading process be faster? Meaning - uploading to memory – goidelg May 30 '22 at 18:44
  • No, you only have one disk, and its IO is the bottleneck. – mdurant May 30 '22 at 19:29
  • @mdurant really? I would have thought parsing CSVs adds significant overhead. `dd.read_csv(...).compute()` is definitely faster for me compared with `pd.read_csv(...)` for large CSV files. – Michael Delgado May 30 '22 at 20:58
  • anyway - @goidelg it would be good to get more information about the problem. are you doing additional processing beyond your code here? as it is currently, you're not doing anything except inefficiently copying the file, and if you have any non-string values in the file, you're also encoding everything as string (so e.g. the value 1.234 would be re-written in `my_big_file.csv` as "1.234"). and it's certainly not true that pd.DataFrame.to_csv produces unreadable files. – Michael Delgado May 30 '22 at 21:09
  • Update: It ended up being a poor-connection issue, where the files (on Box) turned inaccessible. Thanks! – goidelg Jun 02 '22 at 17:37

1 Answers1

0

The original motivation is to read the data into memory faster. Using dask is a plausible solution, but if the intention is to bring the data into memory, then there are other alternatives available also. For example, modin follows pandas API and could deliver reduction proportional to the number of cores, see docs. The code would roughly look like this:

import modin.pandas as pd
df_in = pd.read_csv(path_in, **options)
... # potentially some additional logic
df_out.to_csv(path_out, **other_options)

If speed/memory efficiency is of primary concern and there is no data transformation happening, then the best alternative is to use shell commands or use Python-based libraries to copy file with pathlib or, if remote data is involved,fsspec.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46