7

I'd essentially like to combine the power of

grep -f 

with

awk '{ if($2=="this is where I'd like to input a file of fixed string patterns") print $0}'

Which is to say, I'd like to search a specific column of a file (File 1) with an input file of patterns (File 2). If a match is found simply:

> outputfile.txt

From a previous post, this awk line is really close:

awk 'NR==FNR{a[$0]=1;next} {n=0;for(i in a){if($0~i){n=1}}} n' file1 file2

Taken from Obtain patterns in one file from another using ack or awk or better way than grep?

But it doesn't search a specific column of file 1. I'm open to other tools as well.

Community
  • 1
  • 1
Chris J. Vargo
  • 2,266
  • 7
  • 28
  • 43

3 Answers3

4

The example you found is indeed very close to what you want, the only difference is that you don't want to match the whole line ($0).

Modify it to something like this:

awk 'NR==FNR { pats[$0]=1; next } { for(p in pats) if($2 ~ p) { print $0; break } }' patterns file

If you only need a fixed string match, use the index() function instead, i.e. replace $2 ~ p with index($2, p).

You could also provide the column number as an argument to awk, e.g.:

awk -v col=$col 'NR==FNR { pats[$0]=1; next } { for(p in pats) if($col ~ p) { print $0; break } }' patterns file

Edit - whole field matching

You can accomplish this with the == operator:

awk -v col=$col 'NR==FNR { pats[$0]=1; next } { for(p in pats) if($col == p) { print $0; break } }' patterns file
Thor
  • 45,082
  • 11
  • 119
  • 130
  • Thanks! This works well on smaller numbers of patterns. My patterns are fixed strings, and I only want exact matches. I know that for instance when using Grep, adding a fixed string option really cuts down on processing time. Is there an awk equivalent? – Chris J. Vargo Jan 24 '13 at 23:08
  • 1
    @ChrisJ.Vargo: yes, the `index` function does fixed string matching (as mentioned in the answer). – Thor Jan 24 '13 at 23:25
  • Thanks, but: awk 'NR==FNR { pats[$0]=1; next } { for(p in pats) if(index($5, p)) { print $0; break } }' 1.txt PrimaryTweets.tsv > 1Method2index.tsv Doesn't return exact matches. Is there a way to force an exact match? – Chris J. Vargo Jan 24 '13 at 23:33
  • @ChrisJ.Vargo: so what you really want is whole word matching or is it whole field matching? See the edit for whole field matching. If you mean whole word, you either need to use regex or do some further field splitting. – Thor Jan 25 '13 at 00:20
3

This is using awk:

awk 'BEGIN { while(getline l < "patterns.txt") PATS[l] } $2 in PATS' file2

Where file1 is the file you are searching, and patterns.txt is a file with one exact pattern per file. The implicit {print} has been omitted but you can add it and do anything you like there.

The condition $2 in PATS will be true is the second column is exactly one of the patterns.

If patterns.txt are to be treated as regexp matches, modify it to

ok=0;{for (p in PATS) if ($2 ~ p) ok=1}; ok

So, for example, to test $2 against all the regexps in patterns.txt, and print the third column if the 2nd column matched:

awk 'BEGIN { while(getline l < "patterns.txt") PATS[l] } 
     ok=0;{for (p in PATS) if ($2 ~ p) ok=1}; ok 
    {print $3}' < file2

And here's a version in perl. Similar to the awk version except that it uses regexps instead of fields.

perl -ne 'BEGIN{open $pf, "<patterns.txt"; %P=map{chomp;$_=>1}<$pf>} 
   /^\s*([^\s]+)\s+([^\s]+).*$/ and exists $P{$2} and print' < file2

Taking that apart:

BEGIN{
  open $pf, "<patterns.txt"; 
  %P = map {chomp;$_=>1} <$pf>;
}

Reads in your patterns file into a has %P for fast lookup.

/^\s*([^\s]+)\s+([^\s]+).*$/ and  # extract your fields into $1, $2, etc
exists $P{$2} and                 # See if your field is in the patterns hash
print;                            # just print the line (you could also 
                                  # print anything else; print "$1\n"; etc)

It gets slightly shorter if your input file is tab-separated (and when you know that there's exactly one tab between fields). Here's an example that matches the patterns against the 5th column:

 perl -F"\t" -ane '
    BEGIN{open $pf, "<patterns.txt"; %P=map{chomp;$_=>1}<$pf>} 
    exists $P{$F[4]} and print ' file2

This is thanks to perl's -F operator that tells perl to auto-split into columns based on the separator (\t in this case). Note that since arrays in perl start from 0, $F[4] is the 5th field.

Faiz
  • 16,025
  • 5
  • 48
  • 37
  • 1
    Impressive. Seems like I cannot longer avoid dealing with `awk`. Thank you for the detailed explanations. – J. Katzwinkel Jan 23 '13 at 02:10
  • Thanks! This works well on smaller numbers of patterns. On my large list of 1,000, it seems to run out of memory. My patterns are fixed strings, and I only want exact matches. I know that for instance when using Grep, adding a fixed string option (-F) really cuts down on processing time. Is there an awk equivalent? – Chris J. Vargo Jan 24 '13 at 23:07
  • What if you try the `perl` version I just posted? – Faiz Jan 25 '13 at 01:16
  • What version of `awk` and what kind of `OS` are you running on? – Faiz Jan 25 '13 at 02:22
  • When fed a large list, the perl version only seems to retrieve one pattern, and then exits. Running Mountain Lion. – Chris J. Vargo Jan 25 '13 at 15:39
  • You mean the last `perl` example? Also, I noticed that you refer to your input file as `file2`.. Curious to know What's the output of `xxd file1 | head`? – Faiz Jan 25 '13 at 17:00
0

I am not quite sure which part the distinction of columns plays in this scenario. Do you process some kind of csv file? Do you take care of column delimiters in the regex list file? If there are no distinct columns separated by certain delimiters in your file, you could just use grep:

grep -o -f file2 file1

If columns is an issue, maybe something like this:

grep -o "[^,]*" file1 | grep -f file2

where , is the delimiter.

J. Katzwinkel
  • 1,923
  • 16
  • 22
  • It's tab separated, fifth column. Can grep just skip the first 55 characters, and then only return a match if it's found before the first tab? This would force it to start in the 5th column, and stop before the next one. I like grep, because with the fixed strings option, it is much quicker than awk. – Chris J. Vargo Jan 24 '13 at 23:11
  • There is a bug in `grep` regarding tabs, but by using the Perl switch `-P`, you can pass them like you would expect: `\t`. However, it seems more suitable to use `cut` here, whose delimiter is tab by default and that could prepare your file1 (`-f 5`). – J. Katzwinkel Jan 27 '13 at 11:03