Split csv file thousands of times based on groupby

Question

(An adaptation of David Erickson's question here)

Given a CSV file with columns A, B, and C and some values:

echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv

We would like to sort by columns a and b:

sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438

And then for every unique pair (a, b) create a new CSV titled '{a}_Invoice_{b}.csv'.

The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk but ran into awk: 17 makes too many open files.

Is there a quicker way to do this, in awk, Python, or some other scripting language?

Additional info:

I know I can do this in Pandas - I'm looking for a faster way using text processing
Though I used urandom to generate the sample data, the real data has runs of recurring values: for example a few rows where a=3, b=7. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)

you could try out to load your orignal file with pandas, sort it by (a,b) and store it from there. See [python-splitting-dataframe-into-multiple-dataframes-based-on-column-values-and](https://stackoverflow.com/questions/40498463/python-splitting-dataframe-into-multiple-dataframes-based-on-column-values-and). As is your question is offtopic and uses multiple languages. — Patrick Artner, Apr 12 '20 at 12:05
If you only care about the first two columns, why not remove column 3 prior to sorting and `.csv` file creation? — rtx13, Apr 12 '20 at 12:06
and [how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns](https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns) — Patrick Artner, Apr 12 '20 at 12:06
@rtx13 the data in this example for Column C would still be relevant. It just wouldn't be part of the dynamic filename based off Columns A and B. From the original linked question, I was finding that my method in python was generating .csv files somewhat slowly, and I was wondering if there was a faster way in python. JoshFriedlander thought this might be something that awk would be very fast at. — David Erickson, Apr 12 '20 at 12:14

Alex · Accepted Answer · 2020-04-12T13:37:36.920

4

In python:

import pandas as pd

df = pd.read_csv("file.csv")
for (a, b), gb in df.groupby(['a', 'b']):
    gb.to_csv(f"{a}_Invoice_{b}.csv", header=True, index=False)

In awk you can split like so, you will need to put the header back on each resultant file:

awk -F',' '{ out=$1"_Invoice_"$2".csv"; print >> out; close(out) }' file.csv

With adding the header line back:

awk -F',' 'NR==1 { hdr=$0; next } { out=$1"_Invoice_"$2".csv"; if (!seen[out]++) {print hdr > out} print >> out; close(out); }' file.csv

The benefit of this last example is that the input file.csv doesn't need to be sorted and is processed in a single pass.

edited Apr 12 '20 at 13:37

answered Apr 12 '20 at 12:15

Alex

6,610
3
20
38

Yup, the Python one was the approach in the original question. With awk, it looks good - but will it work if there are multiple of the same value? For example, several consecutive rows where a and b are the same, desired output is a single file containing those lines. – Josh Friedlander Apr 12 '20 at 12:21
Yes, the awk will append to existing files. You can test this by duplicating the example input. – Alex Apr 12 '20 at 12:22
You don't need a separate shell loop just to print a header line to output files, a simple `!seen[out]++{print hdr > out}` inside your awk script will do it. You do still need to update your existing script to either skip or otherwise handle that first line though, e.g. `NR==1{hdr=$0;next}`. – Ed Morton Apr 12 '20 at 12:47

score 2 · Answer 2 · answered Apr 12 '20 at 12:50

2

Since your input is to be sorted on the key fields all you need is:

sort -t ',' -k1,1n -k2,2n file.csv |
awk -F ',' '
NR==1 { hdr=$0; next }
{ out = $1 "_Invoice_" $2 ".csv" }
out != prev {
    close(prev)
    print hdr > out
    prev = out
}
{ print > out }
'

answered Apr 12 '20 at 12:50

Ed Morton

188,023
17
78
185

1

This works really well, thank you! Would you expect it to be faster than the equivalent in Python? – Josh Friedlander Apr 12 '20 at 13:59
3

You're welcome. Yes but not as robust because it's not doing as much work as Python to verify the CSV format so if your first field could be quoted and contain a comma, for example, then I expect a Python script using a CSV reader would handle that correctly while the awk script I posted wouldn't. See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) for what it takes to handle CSVs robustly with awk. – Ed Morton Apr 12 '20 at 14:06

Split csv file thousands of times based on groupby

2 Answers2

Linked