normalizing a column in awk

Question

I'm quite new to awk.

I am trying to write a script that takes an input file, finds the sum of the third column, and then prints columns 1, 2, and then the normalized third column. However, when I do this, I only seem to be doing this for the last row of my input file. I think I am missing something about how 'END' works. Any tips?

Thanks!

BEGIN {
     col= ARGV[2]
     ARGV[2] = ""
}

{s1 += $3}

END {  if (NR > 0){
                print s1;
                print $1, $2, $3/s1
            }
}

INPUT:

     0          2   8.98002e-05
     1          0   5.66203e-05
     2          2   2.20586e-05
     3          2   5.31672e-05
     4          2   2.17192e-07
     5         26   3.67908e-06
     6          1   1.0385e-05
     7          1   7.78022e-05
     8          0   5.47272e-05
     9          1   6.34726e-05
    10          1   0.000105879
    11          1   4.77847e-05
    12          0   3.05258e-05
    13          0   5.53268e-05
    14          1   7.8916e-05
    15          1   3.02601e-05
    16          1   3.81807e-05

s1: 0.000818803

OUTPUT:
0.000818803
0 2 0.109673
0.000818803
1 0 0.0691501
0.000818803
2 2 0.0269401
0.000818803
3 2 0.0649328
0.000818803
4 2 0.000265256
0.000818803
5 26 0.00449324
0.000818803
6 1 0.0126831
0.000818803
7 1 0.0950194
0.000818803
8 0 0.0668381
0.000818803
9 1 0.0775188
0.000818803
10 1 0.129309
0.000818803
11 1 0.0583592
0.000818803
12 0 0.037281
0.000818803
13 0 0.0675703
0.000818803
14 1 0.0963797
0.000818803
15 1 0.0369565
0.000818803
16 1 0.0466299

Welcome to SO, so you want to print every row's 3 column's divide with current row? NOT in last one right? If possible please show sample of input and expected output too in your post once. — RavinderSingh13, Nov 17 '18 at 11:51
Also not sure why yo are making `ARGV[2]` as null in `BEGIN` section too. — RavinderSingh13, Nov 17 '18 at 11:57
Thanks for the welcome! I want to divide every value in column three by the total value of column three. — n00bu, Nov 17 '18 at 12:05
Yes, I tried to that only but you said it is not working so kindly add samples in your post now and let us know then. — RavinderSingh13, Nov 17 '18 at 12:06
Sorry it is not clear yet, please do let us know logic behind the expected output once? — RavinderSingh13, Nov 17 '18 at 12:40
@RavinderSingh13 even without print s1 every other line, I would be very happy. I had added that as a check to my script. I am trying to normalise the third column by its sum. — n00bu, Nov 17 '18 at 12:44
You mean let's say 1st row's 3rd column you saved into a variable and then in 2nd row's 3rd column you added to it and then print `$3/(current $3+previous $3)`, is it right? — RavinderSingh13, Nov 17 '18 at 12:54
Sorry for the confusion. I want to take the sum of the third column of my input, set it equal to s1, and create a new column (or replace the old, either way) where each value is the value of the third column divided by s1. — n00bu, Nov 17 '18 at 12:57
Sorry I am not able to get the same result :( will delete my answer as of now. — RavinderSingh13, Nov 17 '18 at 13:12
What is your *expected output* for the input you've provided? — ghoti, Nov 17 '18 at 15:08

jas · Answer 1 · 2018-11-17T14:53:04.653

For this, one way or another, you'll have to make two passes through the records. One way is to read the file itself twice as in the first method shown below.

The first pass simply accumulates the total of column 3 in s1. The second pass prints the first two columns with the normalized third.

Note that you have to provide the file twice on the command line so that awk processes it twice!

$ awk 'NR == FNR {s1 += $3; next} {print $1, $2, $3/s1}' file file
0 2 0.109673
1 0 0.0691501
2 2 0.0269401
3 2 0.0649329
4 2 0.000265256
5 26 0.00449324
6 1 0.0126832
7 1 0.0950195
8 0 0.0668381
9 1 0.0775188
10 1 0.12931
11 1 0.0583592
12 0 0.037281
13 0 0.0675704
14 1 0.0963798
15 1 0.0369565
16 1 0.0466299

Another way, which is closer to where you were headed with your attempt, is to only read the file once, keeping all the row information in memory while you simultaneously sum column 3.

Then in the END block which is run after all records are read and the sum is fully accumulated, you iterate through the array to print out the results.

 awk '    { s1 += $3; a[NR] = $1 OFS $2; b[NR] = $3 }
      END { for (i=1; i<=NR; ++i) print a[i], b[i] / s1 }' file

This second method has the obvious downside of using much more memory --- in fact with a very large file this approach may not even be feasible.

If you're not already familiar with the NR == FNR construct see What is "NR==FNR" in awk? . Also see the section on "Two-file processing" at https://backreference.org/2010/02/10/idiomatic-awk/ .

Thank you very much! It works exactly as I was hoping! =D – n00bu Nov 17 '18 at 15:28 — n00bu, Nov 17 '18 at 15:28

normalizing a column in awk

1 Answers1