I am looking for a way to filter a (~12 Gb) largefile.txt
with long strings in each line for each of the words (one per line) in a queryfile.txt
. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.
For example, if queryfile.txt
has the words:
this
next
And largefile.txt
has the lines:
this is the first line with an ABCword # contents of first line will be saved
and there is an ABCword2 in this one as well # contents of 2nd line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
(Notice that the largefile.txt
always has a word starting with ABC
included in every line. It's also impossible for one of the query words to start with "ABC")
The save file should look similar to:
this ABCword1
this ABCword2
next ABCword2
So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:
LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt
The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.
I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:
awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt