1

I am looking for a way to filter a (~12 Gb) largefile.txt with long strings in each line for each of the words (one per line) in a queryfile.txt. But afterwards, instead of outputting/saving the whole line that each query word is found in, I'd like to save only that query word and a second word which I only know the start of (e.g. "ABC") and that I know for certain is in the same line the first word was found in.

For example, if queryfile.txt has the words:

this
next

And largefile.txt has the lines:

this is the first line with an ABCword  # contents of first line will be saved
and there is an ABCword2 in this one as well  # contents of 2nd line will be saved
and the next line has an ABCword2 too  # contents of this line will be saved as well
third line has an ABCword3    # contents of this line won't

(Notice that the largefile.txt always has a word starting with ABC included in every line. It's also impossible for one of the query words to start with "ABC")

The save file should look similar to:

this ABCword1
this ABCword2
next ABCword2

So far I've looked into other similar posts' suggestions, namely combining grep and awk, with commands similar to:

LC_ALL=C grep -f queryfile.txt largefile.txt | awk -F"," '$2~/ABC/' > results.txt

The problem is that not only is the query word not being saved but the -F"," '$2~/ABC/' command doesn't seem to be the correct one for fetching words beginning with 'ABC' either.

I also found ways of only using awk, but still haven't managed to adapt the code to save the word #2 as well instead of the whole line:

awk 'FNR==NR{A[$1]=$1;next} ($1 in A){print}' queryfile.txt largefile.txt > results.txt
Piers
  • 21
  • 4
  • Your question says you want to match a `pattern` which is ambiguous (is that a string or a regexp? is it partial or full word or line?) and then your text says you want to search `for the words` but your first script is doing 2 partial regexp matches while your second is doing a full-word string match. So it's hard to know for sure what you're trying to do. Please read https://stackoverflow.com/q/65621325/1745001 to understand the problem and then replace "pattern" with string-or-regexp + full-or-partial and word-or-line and clarify your needs. – Ed Morton Nov 21 '21 at 15:40
  • Please [edit] your example to inbclude lines you do not printed and lines that have words after the last word you want to match. – Ed Morton Nov 21 '21 at 15:49
  • Are the query words always the first words in the lines of the largefile as the samples suggest? – James Brown Nov 21 '21 at 16:04
  • 1
    @JamesBrown they are not, I'll edit my samples, thanks! – Piers Nov 21 '21 at 16:07
  • 1
    Please include in your example a case where `ABCword` appears **before** `this` so we can see how that should be treated. And please get rid of the text that appears in your in[put preceded by a `#` which I assume isn't really present in your real input and you don't want us to include in the search when testing. – Ed Morton Nov 21 '21 at 16:32
  • And please, lose that `this` from the comment of the third record... – James Brown Nov 21 '21 at 16:36
  • Why are you using `-F","` in your code when there are no commas in your sample input/output? – Ed Morton Nov 21 '21 at 16:40
  • Also add a case where a word that starts with `ABC` is present in `queryfile.txt`. Basically consider and include the non-trivial cases in your example. – Ed Morton Nov 21 '21 at 16:42
  • You're getting multiple answers that are doing, for example, partial regexp instead of full-word-string matches and so will appear to work given your current input/output but will fail given more interesting examples. – Ed Morton Nov 21 '21 at 18:36

4 Answers4

1

2nd attempt based on updated sample input/output in question:

$ cat tst.awk
FNR==NR { words[$1]; next }
{
    queryWord = otherWord = ""
    for (i=1; i<=NF; i++) {
        if ( $i in words ) {
            queryWord = $i
        }
        else if ( $i ~ /^ABC/ ) {
            otherWord = $i
        }
    }
    if ( (queryWord != "") && (otherWord != "") ) {
        print queryWord, otherWord
    }
}

$ awk -f tst.awk queryfile.txt largefile.txt
this ABCword
next ABCword2

Original answer:

This MAY be what you're trying to do (untested):

awk '
    FNR==NR { word2lgth[$1] = length($1); next }
    ($1 in word2lgth) && (match(substr($0,word2lgth[$1]+1),/ ABC[[:alnum:]_]+/) ) {
        print substr($0,1,word2lgth[$1]+1+RSTART+RLENGTH)
    }
' queryfile.txt largefile.txt > results.txt
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

Given:

cat large_file
this is the first line with an ABCword 
and the next line has an ABCword2 too CRABCAKE 
third line has an ABCword3 
ABCword4 and this is behind

cat query_file
this
next

(The comments you have on each line of large_file are eliminated otherwise ABCword3 prints since there is 'this' in the comment.)

You can actually do this entirely with GNU sed and tr manipulation of the query file:

pat=$(gsed -E 's/^(.+)$/\\b\1\\b/' query_file | tr '\n' '|' | gsed 's/|$//')
gsed -nE "s/.*(${pat}).*(\<ABC[a-zA-Z0-9]*).*/\1 \2/p; s/.*(\<ABC[a-zA-Z0-9]*).*(${pat}).*/\1 \2/p" large_file

Prints:

this ABCword
next ABCword2
ABCword4 this
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Oh, I see, I didn't notice that first sed was doing that, now I understand what you're doing. I'll remove that comment, sorry for the noise. You could do that whole first `gsed | tr | gsed` with one `gsed -z` then (something like `s/(\n)(.)/\1|\\b\2/` maybe to start with) but it might be more trouble than it's worth figuring out the right sequence of instructions. – Ed Morton Nov 22 '21 at 13:47
0

Using sed in a while loop

$ cat queryfile.txt
this
next


$ cat largefile.txt
this is the first line with an ABCword # contents of this line will be saved
and the next line has an ABCword2 too # contents of this line will be saved as well
third line has an ABCword3 # contents of this line won't
$ while read -r line; do sed -n "s/.*\($line\).*\(ABC[^ ]*\).*/\1 \2/p" largefile.txt; done < queryfile.txt
this ABCword
next ABCword2
HatLess
  • 10,622
  • 5
  • 14
  • 32
  • only the query words were outputted, is there a way to include the respective "ABC"-words in it? – Piers Nov 21 '21 at 16:17
  • @Piers I was not aware the data had changed. Please check edit – HatLess Nov 21 '21 at 17:17
  • That would be very slow and fragile (e.g. it'd match `this` with `thistle`, match `ABC[^ ]*` with `ABCAKES` if `CRABCAKES` existed in the input, etc.). See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). The guys who invented shell also invented awk for shell to call to do text manipulation tasks like this. – Ed Morton Nov 21 '21 at 18:32
0

This one assumes your queryfile has more entries than there are words one a line in the largefile. Also, it does not consider your comments as comments but processes them as reqular data and therefore if cut'n'pasted, the third record is a match too.

$ awk '
NR==FNR {                              # process queryfile
    a[$0]                              # hash those query words
    next
}
{                                      # process largefile
    for(i=1;i<=NF && !(f1 && f2);i++)  # iterate until both words found
        if(!f1 && ($i in a))           # f1 holds the matching query word
            f1=$i
        else if(!f2 && ($i~/^ABC/))    # f2 holds the ABC starting word 
            f2=$i
    if(f1 && f2)                       # if both were found
        print f1,f2                    # output them 
    f1=f2=""
}' queryfile largefile
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • `f1` and `f2` were flags originally, hence the stupid naming. Sorry about that. And once they are carved to stone, there's no changing them anymore... – James Brown Nov 21 '21 at 16:38