Writing Dask partitions into single file

Question

New to dask,I have a 1GB CSV file when I read it in dask dataframe it creates around 50 partitions after my changes in the file when I write, it creates as many files as partitions.
Is there a way to write all partitions to single CSV file and is there a way access partitions?
Thank you.

score 42 · Accepted Answer · edited Oct 23 '19 at 14:25

Short answer

No, Dask.dataframe.to_csv only writes CSV files to different files, one file per partition. However, there are ways around this.

Concatenate Afterwards

Perhaps just concatenate the files after dask.dataframe writes them? This is likely to be near-optimal in terms of performance.

df.to_csv('/path/to/myfiles.*.csv')
from glob import glob
filenames = glob('/path/to/myfiles.*.csv')
with open('outfile.csv', 'w') as out:
    for fn in filenames:
        with open(fn) as f:
            out.write(f.read())  # maybe add endline here as well?

Or use Dask.delayed

However, you can do this yourself using dask.delayed, by using dask.delayed alongside dataframes

This gives you a list of delayed values that you can use however you like:

list_of_delayed_values = df.to_delayed()

It's then up to you to structure a computation to write these partitions sequentially to a single file. This isn't hard to do, but can cause a bit of backup on the scheduler.

Edit 1: (On October 23, 2019)

In Dask 2.6.x, there is a parameter as single_file. By default, It is False. You can set it True to get single file output without using df.compute().

For Example:

df.to_csv('/path/to/myfiles.csv', single_file = True)

Reference: Documentation for to_csv

Thank you for your reply is there going to be any option in future releases where we can do it directly. — rey, Sep 20 '16 at 04:54
Another quick question if I do compute after everything it converts to pandas datafrme, so does it load the data in memory? — rey, Sep 20 '16 at 07:07
If you call `.compute()` on the dask.dataframe then you'll get a single pandas dataframe. If you use dask.delayed then everything will be lazy. — MRocklin, Sep 20 '16 at 11:50

score 5 · Answer 2 · answered Sep 05 '19 at 20:24

5

you can convert your dask dataframe to a pandas dataframe with the compute function and then use the to_csv. something like this:

df_dask.compute().to_csv('csv_path_file.csv')

answered Sep 05 '19 at 20:24

Fernando Siqueira

51
1
2

1

I like simple, intuitive, practical and clean code. :-) – MGB.py Dec 17 '19 at 12:11
9

But in this case you can just use pandas as df has to fit in memory. – rpanai Feb 04 '20 at 15:52

Writing Dask partitions into single file

2 Answers2

Short answer

Concatenate Afterwards

Or use Dask.delayed

Linked