I have a list of ids in a file and a data file (of ~3.2Gb in size), and I want to extract the lines in the data file that contain the id and also the next line. I did the following:
grep -A1 -Ff file.ids file.data | grep -v "^-" > output.data
This worked, but also extracted unwanted substrings, for example if the id is EA4
it also pulled out the lines with EA40
.
So I tried using the same command but adding the -w
(--word-regexp
) flag to the first grep
to match whole words. However, I found my command now ran for >1 hour (rather than ~26 seconds) and also started using 10s of gigabytes of memory, so I had to kill the job.
Why did adding -w
make the command so slow and memory grabbing? How can I efficiently run this command to get my desired output? Thank you
file.ids
looks likes this:
>EA4
>EA9
file.data
looks like this:
>EA4 text
data
>E40 blah
more_data
>EA9 text_again
data_here
output.data
would look like this:
>EA4 text
data
>EA9 text_again
data_here