(An adaptation of David Erickson's question here)
Given a CSV file with columns A, B, and C and some values:
echo 'a,b,c' > file.csv
head -c 10000000 /dev/urandom | od -d | awk 'BEGIN{OFS = ","}{print $2, $3, $4}' | head -n 10000 >> file.csv
We would like to sort by columns a and b:
sort -t ',' -k1,1n -k2,2n file.csv > file_.csv
head -n 3 file_.csv
>a,b,c
3,50240,18792
7,54871,39438
And then for every unique pair (a, b)
create a new CSV titled '{a}_Invoice_{b}.csv'
.
The main challenge seems to be the I/O overhead of writing thousands of files - I started trying with awk
but ran into awk: 17 makes too many open files
.
Is there a quicker way to do this, in awk, Python, or some other scripting language?
Additional info:
- I know I can do this in Pandas - I'm looking for a faster way using text processing
- Though I used
urandom
to generate the sample data, the real data has runs of recurring values: for example a few rows wherea=3, b=7
. If so these should be saved as one file. (The idea is to replicate Pandas' groupby -> to_csv)