Most efficient method to split file into multiple files based on a column

Question

I have been searching for a method to do this efficiently for a while now and can't come up with the best solution.

The requirement is simple. I have a file of the following format.

$cat mymainfile
rec1,345,field3,....field20
rec1,645,field3,....field20
rec12,345,field3,....field20
frec23,45,field3,....field20
rec34,645,field3,....field20

At the end of the split operation, I want to have multiple separate files with these names

$cat some_prefix_345_some_suffix_date
rec1,345,field3,....field20
rec12,345,field3,....field20

$cat some_prefix_645_some_suffix_date
rec1,645,field3,....field20
rec34,645,field3,....field20

$cat some_prefix_45_some_suffix_date
frec23,45,field3,....field20

I thought of using grep, but it has to find unique ids and then grep for each as we don't know the ids (345,645 etc) that are in the file prior to reading the mymainfile.

Then I thought of csplit for eg this here Split one file into multiple files based on delimiter but it splits based on the delimiter and not on a specific column.

When it comes to bash scripting, I know I can read line by line using a while loop and split it but don't know if it's going to be efficient.

I also thought of awk solutions like awk '$2 == ? { etc but don't know how to get those different filenames. I may do it programmatically using python but prefer a single command line and I know it's possible. I'm tired of searching and still can't figure out the best approach for this though. Any suggestions / best approach would be much appreciated.

jas · Accepted Answer · 2018-11-16T19:24:46.133

Within awk you can redirect the output of each line to a different file whose name you build dynamically, (based on $2 in this case):

$ awk -F, '{print > ("some_prefix_" $2 "_some_suffix_date")}' file

$ ls *_date
some_prefix_345_some_suffix_date    some_prefix_45_some_suffix_date     some_prefix_645_some_suffix_date

$ cat some_prefix_345_some_suffix_date 
rec1,345,field3,....field20
rec12,345,field3,....field20

$ cat some_prefix_645_some_suffix_date 
rec1,645,field3,....field20
rec34,645,field3,....field20

$ cat some_prefix_45_some_suffix_date 
frec23,45,field3,....field20

As pointed out in the comments, if you have many different values of $2 and you get an error for too many open files, you can close as you go:

 $ awk -F, '{fname = "xsome_prefix_" $2 "_some_suffix_date"
             if (a[fname]++) print >> fname; else print > fname;
             close fname}' file

Thanks. I'll try and test this. Will accept it unless there are any other alternative answers/ approaches to match it. — itguy, Nov 16 '18 at 19:43
Thanks again. My main file may contain not more than 20-30 such ids, but the rows may be in 100s of thousands, so closing won't be needed in my case. — itguy, Nov 17 '18 at 04:37

Martin T. · Answer 2 · 2018-11-19T10:38:39.723

It might me slower than awk but I would start with

cat mymainfile |  cut -d, -f2 | sort -u

to get your needed different second value. Then make a loop with egrep and use gnu parallel to speed it up:

cat mymainfile |  cut -d, -f2 | sort -u | parallel 'egrep "[^,]+,{}," mymainfile  > some_prefix_{}_some_suffix_date'

{} is expanded to the different values in the parallel command. The regex after egrep "[^,]+,{}," should match only for the value in the second column.

Because of this two loops and the wish to work with an continuously growing file:

cat mymainfile | parallel 'echo {} >> some_prefix_$(echo {} | cut -d\, -f2)_some_suffix_date'

unfortunately this invokes a subshell with makes it slower. Just give it a try.

Thanks for your answer. I thought of this option but cutting and sorting may add additional overhead as the size of the main file increases. — itguy, Nov 17 '18 at 04:40

Most efficient method to split file into multiple files based on a column

2 Answers2