I had a similar problem and I found that use dask to split to smallest parquet is very slow and will eventually fail. If you have access to a Linux Terminal you can use parallel or split. For an example of their usage check answers from here
My workflow is supposing your files are called file1.csv,..., file7.csv
and are stored in data/raw
. I'm assuming you are using the terminal commands from your notebook and this is the reason I'm adding the %%bash
magic
- create folders
data/raw_part/part1/,... ,data/raw_part/part7/
%%bash
for year in {1..7}
do
mkdir -p data/raw_parts/part${i}
done
- for each file run (in case you want to use parallel)
%%bash
cat data/raw/file1.csv | parallel --header : --pipe -N1000000 'cat >data/raw_parts/part1/file_{#}.csv'```
convert files to parquet
- first create output folders
%%bash
for year in {1..7}
do
mkdir -p data/processed/part${i}
done
- define function to convert csv to parquet
import pandas as pd
import os
from dask import delayed, compute
# this can run in parallel
@delayed
def convert2parquet(fn, fldr_in, fldr_out):
fn_out = fn.replace(fldr_in, fldr_out)\
.replace(".csv", ".parquet")
df = pd.read_csv(fn)
df.to_parquet(fn_out, index=False)
- get all files you want to convert
jobs = []
fldr_in = "data/raw_parts/"
for (dirpath, dirnames, filenames) in os.walk(fldr_in):
if len(filenames) > 0:
jobs += [os.path.join(dirpath, fn)
for fn in filenames]
- process all in parallel
%%time
to_process = [convert2parquet(job, fldr_in, fldr_out) for job in jobs]
out = compute(to_process)