When using dask.to_parquet(df, filename)
a subfolder filename
is created and several files are written to that folder, whereas pandas.to_parquet(df, filename)
writes exactly one file.
Can I use dask's to_parquet
(without using compute()
to create a pandas df) to just write a single file?
Asked
Active
Viewed 5,090 times
7

Christian
- 372
- 3
- 13
-
Have you considered `.tar`-ing them up afterwards? – afaulconbridge Nov 15 '20 at 18:44
2 Answers
4
There is a reasons to have multiple files (in particular when a single big file doesn't fit in memory) but if you really need 1 only you could try this
import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1_000,5))
df = dd.from_pandas(df, npartitions=4)
df.repartition(npartitions=1).to_parquet("data")

rpanai
- 12,515
- 2
- 42
- 64
-
2This does somehow produce a file, but still it's creating a subfolder and not just a single file like pandas.to_parquet would do. – Christian Apr 08 '20 at 21:36
-
1If it helps a little, the `.to_parquet` method in Dask 2.16.0 supports the parameter `write_metadata_file`. With `write_metadata_file=False`, `.to_parquet(...)` only produces `data/*.parquet` (without the two metadata files). – edesz May 13 '20 at 15:30
-
What is the consequence of not writing the metadata file? In terms of reading the files later down the line – AmyChodorowski Jan 19 '21 at 17:23
-
@AmyChodorowski i didn't find any particular difference but maybe you should open a question so some developer could answer it. – rpanai Jan 19 '21 at 18:56
-
@AmyChodorowski-still writes the parquet file in a sub-folder but without the metadata files. The singular parquet file that was written looks good but the naming convention is "part.0.parquet". – Nguai al Jul 01 '22 at 14:17
2
Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library).
You could in theory perform the operation with a non-trivial amount of work on your part: you would need to iterate through the partitions of your dataframe, write to the target file (which you keep open) and accumulate the output row-groups into the final metadata footer of the file. I would know how to go about this with fastparquet, but that library is not being much developed any more.

mdurant
- 27,272
- 5
- 45
- 74
-
For parallelism, writing the columns in separate threads could at least in theory also be an option. But I do understand, Dask is just not made for this kind of parallelism. – Christian Apr 08 '20 at 21:35
-
Even then, each portion wouldn't know where in the file to seek to to start writing, until the earlier chunks were done – mdurant Apr 09 '20 at 13:21
-
@mdurant can you please explain why `dd.to_parquet(filepath, partition_on="user")` produces parquet files with fairly similar sizes? In my dask dataframe, every partition is a parquet file and I am attempting to group by **user** and save each user data separately using `partition_on`. Now some users may have only 2 rows in the entire dask dataframe and some other users have thousands of rows. How does the `dd.to_parquet` work in this case, since the part.parquet files for all users are relatively equal in size? I am having an issue where the saved data is incomplete. – Pleastry Aug 25 '20 at 12:46
-
If you are losing data, you should make post code showing the process as a bug issue. When you partition_on, you get (up to) one file for each input partition X each unique value, there is no data grouping or shuffling. – mdurant Aug 25 '20 at 13:11