Efficient search pattern in large CSV file

Question

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user @anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.

I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.

My input file dataset.csv is something like this:

action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC

I am using the following to get the results I want (again, this is thanks to @anubhava).

awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file

This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.

I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.

I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.

As always, I appreciate any advice.

I don't think ```gawk``` has a limit on the opened fd-s. try removing the ```close(fn)```. Or sort the file based on the 2-nd field and ```close(fn)``` only when the value changes... — vgersh99, Oct 21 '20 at 13:16
Thanks @vgersh99 I am trying it. I am sorting the file first. You are right, that will probably make it faster. Also, if the file is being close after each found it probably has to go through the whole thing every time. Thanks, trying it now. — Wilmar, Oct 21 '20 at 13:34
You should try Ed Morton's solution from your other question.... — dawg, Oct 21 '20 at 13:40
I can actually report results now. I did as @vgersh99 suggested, sorted the file with `sort -t, -nk2 dataset.csv > sorted_dataset.csv` and then `awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' sorted_dataset.csv` and it was lightining fast (20 seconds to sort and around 10 seconds for awk). Then I tried without sorting and it was all around 25 seconds! What an improvement! Thank you. Please post it as an answer so I can mark it as correct. — Wilmar, Oct 21 '20 at 13:42
@Wilmar, you're still closing the file for each line/record. ```awk -F, 'NR > 1{if(fn = $2 "_dataset.csv"; if(!seen[$2]++) close(fn);print >> fn}' < (sort -t, -nk2 dataset.csv)``` — vgersh99, Oct 21 '20 at 13:53

score 3 · Accepted Answer · answered Oct 21 '20 at 14:00

3

with sorting:

awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}'  < (sort -t, -nk2 dataset.csv)

or with gawk (unlimited number of opened fd-s)

gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv

answered Oct 21 '20 at 14:00

vgersh99

877
1
5
9

Very good suggestion of using pre-sorted file for non-gnu awk versions ++ – anubhava Oct 21 '20 at 15:24

Ed Morton · Answer 2 · 2020-10-22T18:52:22.093

This is the right way to do it using any awk:

$ tail -n +2 file | sort -t, -k2,2n |
    awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'

The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.

Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

completely agree with all the points! But... 1) out fd is ```out```, therefore we need to ```close(out)```, not ```close($2)``` 2) we need to check if ```out``` is not "null" (first record), before closing this fd. My $.02 — vgersh99, Oct 22 '20 at 13:44
@vgersh99 absolutely right and fixed that it's close(out) but we don't need to check it as close(null) simply does nothing. — Ed Morton, Oct 22 '20 at 18:53

Efficient search pattern in large CSV file

2 Answers2