0

How to split a csv file into multiple files using Dask?

The bellow code seems to write to one file only which takes a long time to write the full thing. I believe writing to multiple files will be faster.

import dask.dataframe as ddf
import dask
file_path = "file_name.csv"
df   = ddf.read_csv(file_path)
futs = df.to_csv(r"*.csv", compute=False)
_, l = dask.compute(futs, df.size)
rpanai
  • 12,515
  • 2
  • 42
  • 64
mongotop
  • 7,114
  • 14
  • 51
  • 76
  • 1
    Have you considered splitting it using [bash](https://stackoverflow.com/a/48154590/4819376)? It might be faster. – rpanai Apr 30 '19 at 17:00

1 Answers1

2

I suspect that when you read df you have df.npartitions is just 1.

import dask.dataframe as dd

file_path = "file_name.csv"
df = dd.read_csv(file_path)
# set how many file you would like to have
# in this case 10
df = df.repartition(npartitions=10)
df.to_csv("file_*.csv")

But as far as I can see it's not faster.

rpanai
  • 12,515
  • 2
  • 42
  • 64