6

I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

This worked, but also extracted unwanted substrings, for example if the id is EA4 it also pulled out the lines with EA40.

So I tried using the same command but adding the -w (--word-regexp) flag to the first grep to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job.

Why did adding -w make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you

file.ids looks likes this:

>EA4
>EA9

file.data looks like this:

>EA4 text
data
>E40 blah
more_data
>EA9 text_again
data_here

output.data would look like this:

>EA4 text
data
>EA9 text_again
data_here
Chris_Rands
  • 38,994
  • 14
  • 83
  • 119
  • `awk` could be faster depending upon your input file, if you haven't tried it yet. If you can provide a sample of your both the files, logic in `awk` could be provided. – Inian Oct 06 '16 at 10:40
  • @Inian thanks, i've added some sample inputs and outputs – Chris_Rands Oct 06 '16 at 10:46
  • can you also add why `grep -v "^-"` is needed? – Sundeep Oct 06 '16 at 10:53
  • @Sundeep because grep -A1 outputs -- lines between matches – Chris_Rands Oct 06 '16 at 11:07
  • also, if you are using GNU grep, you can use `grep --no-group-separator -A1 -wFf file.ids file.data > output.data` – Sundeep Oct 06 '16 at 11:16
  • @Chris_Rands: You can provide a running time analysis for the `awk` solution in my answer, which I will be happy to remove if does not improve performance. – Inian Oct 06 '16 at 11:26
  • @Sundeep thanks but your `awk` solution returned a blank file to me – Chris_Rands Oct 06 '16 at 11:33
  • @Chris_Rands not sure why, I tested it for your input samples and it worked.. anyway, try the GNU grep one I posted if you have that option – Sundeep Oct 06 '16 at 11:39
  • @Sundeep I aplogise, the `file.ids` includes a `>` (see the edited input format), but it did't work either way for me unfortunately – Chris_Rands Oct 06 '16 at 11:41
  • 2
    oh, this one `awk 'NR==FNR{a[$1]++; next} a[$1]{c=2} c&&c--' file.ids file.data` works for me with modified input files.... – Sundeep Oct 06 '16 at 11:42
  • @Sundeep Thank you, this does appear to be working as expected! If you can write it up as an answer (with a little explanation) I will gladly accept. Also, do you know what was happening when I added the `-w` flag to grep? – Chris_Rands Oct 06 '16 at 11:46
  • 1
    @Chris_Rands: Can you share the performance difference between mine and Sundeep's solution for an analysis. – Inian Oct 06 '16 at 11:48
  • I don't know why `-w` would make it so much slower (but that flag is really needed to get correct output as per your question details), so good idea to wait for someone to address that issue... and as Inian suggested, you need to give perf comparison between the two answers, my answer was simply built on this classic Q&A - http://stackoverflow.com/questions/17908555/printing-with-sed-or-awk-a-line-following-a-matching-pattern – Sundeep Oct 06 '16 at 11:51
  • what do you get for `grep --version` ? (and what OS?) . Good luck. – shellter Oct 06 '16 at 13:34
  • @shellter grep (GNU grep) 2.25 on Arch Linux – Chris_Rands Oct 06 '16 at 13:38
  • and what is the result of `awk '{l=length;if (l>max)max=l}END{print NR "\t" max}' file` (for each file). Good luck. – shellter Oct 06 '16 at 13:47
  • incidentally, I don't think older Unix based `grep` (well `fgrep` actually) would have accepted a `-w` option. If you really want to know, you may have to post it to GNU as a bug or spend time plowing thru the source. Good luck! – shellter Oct 06 '16 at 13:50
  • @shellter thanks for your interest. that is line count and max line length? `294670 54` for `file.ids` and `19757294 29409` for `file.data` – Chris_Rands Oct 06 '16 at 13:53
  • the `-w` will scan the whole line looking for a match. (Complete mystery as to why it is hogging memory). Its obvious that `break` in the deleted `awk` answer short circuits checking the whole line. Less obvious to me how @Sundeep 's version manages that ;-/ ? Sundeep, post that answer with a mini-explanation and I'll vote for it. Inian, undelete your msg and I'll vote for it because of its research value. Good luck to all. – shellter Oct 06 '16 at 13:59
  • how about if you change EA0 in your file.ids file for ^EA0\b (ditto every line), this in particular allows it to exit each line as soon as it has passed the nth char. Also allows you to drop the | grep -v "^-". – tolanj Oct 06 '16 at 14:37

1 Answers1

9

grep -F string file is simply looking for occurrences of string in the file but grep -w -F string file has to check each character before and after string too to see if they are word characters or not. That's a lot of extra work and one possible implementation of it would be to first separate lines into every possible non-word-character-delimited string with overlaps of course so that could take up a lot of memory but idk if that's what's causing your memory usage or not.

In any case, grep is simply the wrong tool for this job since you only want to match against a specific field in the input file, you should be using awk instead:

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
data
>EA9 text_again
data_here

The above assumes your "data" lines cannot start with >. If they can then tell us how to identify data lines vs id lines.

Note that the above will work no matter how many data lines you have between id lines, even if there's 0 or 100:

$ cat file.data
>EA4 text
>E40 blah
more_data
>EA9 text_again
data 1
data 2
data 3

$ awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f' file.ids file.data
>EA4 text
>EA9 text_again
data 1
data 2
data 3

Also, you don't need to pipe the output to grep -v:

grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data

just do it all in the one script:

awk 'NR==FNR{ids[$0];next} /^>/{f=($1 in ids)} f && !/^-/' file.ids file.data
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Thanks, this is ~2 seconds faster than Sundeep's solution for my data set (with an identical output). And yes "data" lines cannot start with `>` – Chris_Rands Oct 06 '16 at 14:47
  • 1
    Also the solution working with multiple lines between id lines will be useful in some other cases I'm considering, so thanks for that too – Chris_Rands Oct 06 '16 at 14:50
  • Thanks, in fact the second `grep` is not needed now (this was just to process out the `--` lines created between matches by `grep -A1`. As Sundeep pointed out, I should have used the `--no-group-separator` flag instead.) – Chris_Rands Oct 06 '16 at 15:04
  • Sure, if you have the time to explain quickly, I'm curious, thanks – Chris_Rands Oct 06 '16 at 15:09
  • 1
    wrt the other solution posted that started with `NR==FNR { pats[$0]=1; next } { for(p in pats) if($1 ~ p) ...` - the reason it was so slow is that for every line of file.data it's looping through every value read from file.ids and stored in the array until it finds a match and doing a regexp comparison at every step to determine if there was a match or not. The other awk solutions are simply doing a hash lookup in the array of the first string from the relevant line. – Ed Morton Oct 06 '16 at 16:14