match awk column value to a column in another file

Question

I need to know if I can match awk value while I am inside a piped command. Like below:

  somebinaryGivingOutputToSTDOUT |  grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'

from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.

File where value needs to be matched is as below:

a,b,c,d,e
1,2,30000000000,3,4

I need to match with the 3rd column of the above file.

I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).

I have already read this question.

Adding more info: Breaking my command into parts part1-command:

 somebinaryGivingOutputToSTDOUT |  grep -A3 "sometext" | grep "Something:"

part1-output(just showing 1 iteration output):

Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463

part2-command Now I use awk

awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'

part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429

1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413

Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)

another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50

output should only be

1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413

So for every row of awk we check entry in file2 column C

would be better if you gave sample input file(s) and expected output for that... — Sundeep, Apr 07 '17 at 13:20
do you know you can use `awk` for `grep` functionality as well? If the second lookup file is small (compared to your memory), you can read into an array and have fast lookups. — karakfa, Apr 07 '17 at 13:24
that helps, but it would be more complete to include input sample before the grep+awk combo... everything might be easier to do with single awk command considering entire problem instead of middle approach — Sundeep, Apr 07 '17 at 13:48
Maybe it's just me but could you please edit your question to add the statement "This is my sample input file:" above your sample input file and "This is my desired output file:" above your desired output? Right now I'm JUST not seeing it. You have have a 1-line file which you say is "part 1 output" (but where is the input that it came from?) then you have part2-something with 3 lines (are you trying to generate 3 lines from that 1 line and if so what's the logic?). I'm still completely lost. Instead of adding to the question just clean it up to say THIS is the input, THIS is the output, etc. — Ed Morton, Apr 07 '17 at 17:19

karakfa · Answer 1 · 2017-04-07T16:36:13.397

I'll post a template which you can utilize for your computation

awk 'BEGIN   {FS=OFS=","}
     NR==FNR {lookup[$3]; next} 
  /sometext/ {c=4} 
 c&&c--&&/somemoretext/ {value= # implement your computation here
                         if(value in lookup) 
                             print "what you want"}' lookup.file FS=':' grep.files...

here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.

In awk you can have more control on what column your pattern matches as well, here I replicated grep example.

This is another simplified example to focus on the core of the problem.

awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' | 
awk 'NR==FNR{lookup[$1]; next} 
     $1 in lookup' perfect.numbers -

first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.

28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929

for the lookup file

$ head perfect.numbers
6
28
496
8128

the piped data is substituted as the second file at -.

Thanks. its a binary that is printing to stdout, which i grep and pipe — pythonRcpp, Apr 07 '17 at 13:56

score 0 · Answer 2 · edited May 23 '17 at 12:02

0

Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.

Example:

grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
    'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"} 
     NR==FNR {
         query[$4*10^10+$6]=$4*10^10+$6;  
         out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8; 
         next
     } 
     query[$3]==$3 {
         print out[$3]
     }' - another_file.csv > output.csv

More info on the merging process in the answer cited in the question:

Using AWK to Process Input from Multiple Files

edited May 23 '17 at 12:02

Community

1
1

answered Apr 07 '17 at 13:56

Marcus Tuke

86
4

isnt it creating a key from grepped file ?(I basically grep stdout from a process). If that is the case, it will have memory issues, since it will be very very large. Intent is grepping smaller file entries each time before deciding to print or not print the stream grepped. I hope my explanation is clear enough. Thanks – pythonRcpp Apr 07 '17 at 14:21

Robin479 · Answer 3 · 2017-04-07T13:41:01.323

-1

You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:

grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
    COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
    if grep $COMPUTED /the/file/to/search &>/dev/null; then
        echo $LINE
    fi
done | cat -

edited Apr 07 '17 at 13:41

answered Apr 07 '17 at 13:34

Robin479

1,606
15
16

1

no. never do this. see [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) for some of the reasons. also google unquoted variables, all upper case variable names, UUOC,... – Ed Morton Apr 07 '17 at 15:41

match awk column value to a column in another file

3 Answers3