1

I have a tabular file with 6 columns. What I need to do, is to add a 7th column that counts the occurrence of the value from the column 3. I did it with Excel, adding the formula

=countif(C:C,$C1)

But the files are huge, and I have lots of them

For example:

My input is this one:

0   SL3.0ch03   7675648 21M GATCACTCCAAACTCATCATA   NM:i:2
0   SL3.0ch03   7675648 21M GATCACTCCAAACTCATCATA   NM:i:2
0   SL3.0ch03   7675648 21M GATCACTCCAAACTCATCATA   NM:i:2
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1
0   SL3.0ch03   7675649 21M CTCACTCCAAACTCATCATAC   NM:i:2
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1

And I need an output like this one:

0   SL3.0ch03   7675648 21M GATCACTCCAAACTCATCATA   NM:i:2  3
0   SL3.0ch03   7675648 21M GATCACTCCAAACTCATCATA   NM:i:2  3
0   SL3.0ch03   7675648 21M GATCACTCCAAACTCATCATA   NM:i:2  3
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1  5
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1  5
0   SL3.0ch03   7675649 21M CTCACTCCAAACTCATCATAC   NM:i:2  5
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1  5
0   SL3.0ch03   7675649 21M ATCACTCCAAACTCATCATAC   NM:i:1  5
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1  4
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1  4
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1  4
0   SL3.0ch03   7675650 21M TCACTCCAAACTCATCATACT   NM:i:1  4

I've tried a few things that I found:

awk '{h[$3]++}; END { for(k in h) print k, h[k] }' input.tab

That actually displays the 7th column, but not the rest. I also found that this code:

awk '{print $1,$2,$3,$4,$5,$6}'

prints all the columns, so I thought "this should work":

awk '{print $1,$2,$3,$4,$5,$6,$7};{h[$3]++}; END { for(k in h) print k, h[k] }' input.tab > output.tab

but it obviously didn't. The best thing I could achieve was to print all 6 original columns and the output I need at the bottom of the file, but I need it as a 7th column.

I'm familiar with basic shell commands, but not with AWK language.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Agu_S
  • 13
  • 4

2 Answers2

1

Unfortunately, you'll have to read the file twice for this to work, which depending on size, can be pretty inefficient.

Either way, you were close: awk '{h[$3]++}; END { for(k in h) print k, h[k] }' input.tab is building a map in h[] keyed on the value of $3 and storing the count of it's occurrences, then after all lines are read, printing out that map.

What you want seems like this:

awk 'FNR==NR{h[$3]++;next} {$7=h[$3]; print}' input.tab input.tab

The first read through the file we build the map of $3 to it's count like you were, then the second read (when FNR!=NR) we add that value as $7, then print the resulting line out.

e.g.

$awk 'FNR==NR{h[$3]++;next} {$7=h[$3]; print}' input.tab input.tab
0 SL3.0ch03 7675648 21M GATCACTCCAAACTCATCATA NM:i:2 3
0 SL3.0ch03 7675648 21M GATCACTCCAAACTCATCATA NM:i:2 3
0 SL3.0ch03 7675648 21M GATCACTCCAAACTCATCATA NM:i:2 3
0 SL3.0ch03 7675649 21M ATCACTCCAAACTCATCATAC NM:i:1 5
0 SL3.0ch03 7675649 21M ATCACTCCAAACTCATCATAC NM:i:1 5
0 SL3.0ch03 7675649 21M CTCACTCCAAACTCATCATAC NM:i:2 5
0 SL3.0ch03 7675649 21M ATCACTCCAAACTCATCATAC NM:i:1 5
0 SL3.0ch03 7675649 21M ATCACTCCAAACTCATCATAC NM:i:1 5
0 SL3.0ch03 7675650 21M TCACTCCAAACTCATCATACT NM:i:1 4
0 SL3.0ch03 7675650 21M TCACTCCAAACTCATCATACT NM:i:1 4
0 SL3.0ch03 7675650 21M TCACTCCAAACTCATCATACT NM:i:1 4
0 SL3.0ch03 7675650 21M TCACTCCAAACTCATCATACT NM:i:1 4

Also, if these are truly tab delimited,you'll want to add BEGIN{FS=OFS="\t"} to set the delimiter and output delimiter to tab, as it defaults to space as my output above shows.

awk 'BEGIN{FS=OFS="\t"} FNR==NR{h[$3]++;next} {$7=h[$3]; print}' input.tab input.tab

zzevannn
  • 3,414
  • 2
  • 12
  • 20
  • 1
    `{print $0. h[$3]}` or `{$7=h[$3]}1` might be better. Also `input.tab{,}` eliminates double typing the filename. – karakfa Feb 27 '18 at 22:10
  • Thanks a lot zzevannn, this worked just fine! And thanks karakfa for the suggestion, as well! – Agu_S Feb 27 '18 at 22:42
0

I assume that records with the same 3rd field are consecutive:

awk 'b!=$3{for(j=0;j<i;j++){print a[j],i};delete a;b=$3;i=0;}{a[i++]=$0}END{for (j=0;j<i;j++){print a[j],i}}' file

details:

awk '
    b!=$3 { # when the stored 3rd field is different from the current
        for(j=0;j<i;j++){ print a[j],i }; # display stored records
        delete a; # delete the array
        b=$3; # store the current "new" field
        i=0;
    }
    { a[i++]=$0 } # store the current record and increment the index
    # display the last stored records
    END {for (j=0;j<i;j++){print a[j],i}}
' file
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125