How can I count the frequency of letters

Question

I have a data like this

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK

I want to count how many of each letter is there, so if I have one I count like this

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR

cat input.txt | grep -v ">" | fold -w1 | sort | uniq -c



   6 A
   9 C
  10 D
   1 E
   7 F
  18 G
   5 H
   4 I
   7 K
  21 L
  15 N
   7 P
   6 Q
  11 R
  16 S
  18 T
   7 V
   8 W
   7 Y

however, I want to calculate for all in a better way and more efficient especially when the data is huge

are you reporting counts line by line or once for the whole file? — karakfa, Feb 15 '19 at 15:50
Possible duplicate of [Bash script to find the frequency of every letter in a file](https://stackoverflow.com/questions/3966820/bash-script-to-find-the-frequency-of-every-letter-in-a-file) — Marcus, Feb 15 '19 at 17:22

kvantour · Accepted Answer · 2019-11-15T06:21:20.057

Counting characters in strings can easily be done with awk. To do this, you make use of the function gsub:

gsub(ere, repl[, in]) Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument when specified.

sub(ere, repl[, in ]) Substitute the string repl in place of the first instance of the extended regular expression ERE in string in and return the number of substitutions. <snip> If in is omitted, awk shall use the current record ($0) in its place.

_{source: Awk Posix Standard}

The following two functions perform the counting in this way:

function countCharacters(str) {
    while(str != "") { c=substr(str,1,1); a[toupper[c]]+=gsub(c,"",str) }
}

or if there might appear a lot of equal consecutive characters, the following solution might shave off a couple of seconds.

function countCharacters2(str) {
    n=length(str)
    while(str != "") { c=substr(str,1,1); gsub(c"+","",str);
       m=length(str); a[toupper[c]]+=n-m; n=m
    }
}

Below you find 4 implementations based on the first function. The first two run on a standard awk, the latter two on an optimized version for fasta-files:

1. Read sequence and processes it line by line:

awk '!/^>/{s=$0; while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) } }
     END {for(c in a) print c,a[c]}' file

2. concatenate all sequences and process it in the end:

awk '!/^>/{s=s $0 }
     END {while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
         for(c in a) print c,a[c]}' file

3. Same as 1 but use bioawk:

bioawk -c fastx '{while ($seq!=""){ c=substr($seq,1,1);a[c]+=gsub(c,"",$seq) } }
                 END{ for(c in a) print c,a[c] }' file

4. Same as 2 but use bioawk:

bioawk -c fastx '{s=s $seq}
                 END { while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
                       for(c in a) print c,a[c]}' file

Here are some timing results based on this fasta-file

OP            : grep,sort,uniq : 47.548 s
EdMorton 1    : awk            : 39.992 s
EdMorton 2    : awk,sort,uniq  : 53.965 s
kvantour 1    : awk            : 18.661 s
kvantour 2    : awk            :  9.309 s
kvantour 3    : bioawk         :  1.838 s
kvantour 4    : bioawk         :  1.838 s
karafka       : awk            : 38.139 s
stack0114106 1: perl           : 22.754 s
stack0114106 2: perl           : 13.648 s
stack0114106 3: perl (zdim)    :  7.759 s

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

Thanks. Unfortunately, this is not really optimal This is more an O(n^2) solution than O(n) (due to `gsub`) — kvantour, Feb 15 '19 at 16:41
Wow, this `bioawk` is really efficient. Thanks for all of these :) — Til, Feb 15 '19 at 17:17
It just ocurred to me that for `1` saving $0 in a string and then match()/gsub()-ing on THAT might be more efficient than operating directly on $0 since every time you change $0 you might invoke field splitting again (depending on your awk implementation - some only do field splitting if a field is explicitly mentioned in the script). — Ed Morton, Feb 18 '19 at 13:27
@EdMorton I just ran the test. I did not see a difference in time, but you are correct. With some awk implementation, this will be different. (updated the implementation) — kvantour, Feb 18 '19 at 13:37
I wish everyone on StackExchange explained their answers this well! — Michael Scheper, Sep 01 '22 at 06:08

score 4 · Answer 2 · answered Feb 15 '19 at 16:04

4

With any awk in any shell on any UNIX box:

$ cat tst.awk
!/^>/ {
    for (i=1; i<=length($0); i++) {
        cnt[substr($0,i,1)]++
    }
}
END {
    for (char in cnt) {
        print char, cnt[char]
    }
}

$ awk -f tst.awk file
A 107
N 67
P 107
C 41
Q 88
D 102
E 132
R 101
F 65
S 168
G 140
T 115
H 52
I 84
V 101
W 27
K 114
Y 30
L 174
M 39

or if you prefer:

$ awk -v ORS= '!/^>/{gsub(/./,"&\n"); print}' file | sort | uniq -c
    107 A
     41 C
    102 D
    132 E
     65 F
    140 G
     52 H
     84 I
    114 K
    174 L
     39 M
     67 N
    107 P
     88 Q
    101 R
    168 S
    115 T
    101 V
     27 W
     30 Y

answered Feb 15 '19 at 16:04

Ed Morton

188,023
17
78
185

@Ed Morton `awk: syntax error at source line 1 source file tst.awk context is cat >>> tst. <<< awk awk: bailing out at source line 11` – Learner Feb 15 '19 at 16:24
@Learner Look again at my answer, you're not doing what it shows. Seems like you might have copied the `cat` line into the awk script. – Ed Morton Feb 15 '19 at 16:28
I was thinking about this solution when I saw this question, but I thought multiple `substr` will be a lot more inefficient than that split once only. However I'm not sure about this, am I thinking right this time? – Til Feb 15 '19 at 16:28
1

@Tiw splitting on a null char isn't portable - it'll split into chars in some awks, leave the whole line as-is in others, and can do anything else and still be POSIX compliant since that';s undefined behavior. I really don't know if a single split() would be more efficient than a multiple substr()s or not but IMHO the portability is more important in this case. – Ed Morton Feb 15 '19 at 16:30

score 2 · Answer 3 · edited Feb 21 '19 at 06:07

Try this Perl solution for better performance.

$ perl -lne ' 
            if( ! /^>/ ) { while(/./g) { $kv{$&}++} }  
        END { for ( sort keys %kv) { print "$_ $kv{$_}" }} 
' learner.txt
A 107
C 41
D 102
E 132
F 65
G 140
H 52
I 84
K 114
L 174
M 39
N 67
P 107
Q 88
R 101
S 168
T 115
V 101
W 27
Y 30

$

One more solution using Perl, optimized for performance.

$ time perl -lne ' 
     if( ! /^>/ ) { for($i=0;$i<length($_);$i++) 
     { $x=substr($_,$i,1); $kv{$x}++ } }  
    END { for ( sort keys %kv) { print "$_ $kv{$_}" }} 
' chrY.fa
A 2994088
C 1876822
G 1889305
N 30812232
T 3002884
a 4892104
c 3408967
g 3397589
n 140
t 4953284

real    0m15.922s
user    0m15.750s
sys     0m0.108s

$

Edit with further performance optimizations

All timings reported below are averages over 3-5 runs on a desktop, done at around the same time but swapped around to avoid pronounced cacheing effects.

Changing the C-style for loop to for my $i (0..length($_)) speeds the second solution from 9.2 seconds to 6.8 seconds.

Then, also removing a scalar ($x) at each operation, with

if (not /^>/) { for $i (0..length($_)) { ++$kv{ substr($_,$i,1) } } }

speeds this up to 5.3 seconds.

Further reducing variable use, by copying $_ and thus freeing up the loop to use $_

if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }

only helps a little, running at 5.2 seconds.

This compares with the awk solution, given as kvantour 2 in nice comparisons in kvantour answer, at 6.5 seconds (on this system).

Of course none of this can be compared to the optimized bioawk (C-code?) program. For that we'd need to write this in C (which is not very hard using Inline C).

Note that removing a sub call (to substr) for every character by using

if (not /^>/) { ++$kv{$_} for split //; }

results in "only" a 6.4 seconds average, not as good as the above tweaks; this was a surprise.

These times are on a desktop with v5.16. On v5.24, on the same machine, the best-case (substr with no extra variables in the loop) time is 4.8 seconds while the one without the substr (but with split) is 5.8 seconds. It's nice to see that newer versions of Perl perform better, at least in these cases.

For reference and easy timing by others, complete code for the best run

time perl -lne'
    if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
    END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa

Thanks @kvantour.. I added one more solution by optimizing it.. in my machine it took 15.92 secs for your big file.. ````bioawk```` is great! — stack0114106, Feb 18 '19 at 14:31
@stack0114106 I took the liberty of adding a section to your answer. If that's not to your liking I apologize, please let me know and I'll split it into a separate answer. I thought that it best fits here since it simply tweaks your code. — zdim, Feb 20 '19 at 20:31
@zdim I'll add the timing to the list tomorrow. You might have added it as a separate solution though. Nice explanation — kvantour, Feb 20 '19 at 21:22
@zdim.. great research and optimizations..learning a lot.. you are welcome to add this.. thank you! — stack0114106, Feb 21 '19 at 04:27
@kvantour "_might have added it as a separate solution_" -- it does mostly tweak the code in this answer so I felt it best fit here. (A reminder to add it to your list, if you wish.) — zdim, Mar 01 '19 at 18:21

karakfa · Answer 4 · 2019-02-15T17:08:52.773

1

not sure how much faster this would be but if you try please post your timings

$ awk '!/^>/ {n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]]++} 
       END   {for(k in c) print k,c[k]}' file | sort

A 6
C 9
D 10
E 1
F 7
G 18
H 5
I 4
K 7
L 21
N 15
P 7
Q 6
R 11
S 16
T 18
V 7
W 8
Y 7

this reports counts for the file, not line by line. As noted below, not all awk's support empty string split.

Here are the timings of the three approaches:

$ time grep -v ">" filey | fold -w1 | sort | uniq -c >/dev/null

real    0m11.470s
user    0m11.746s
sys     0m0.260s

$ time awk '{n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]++]} END{for(k in c) print k,c[k]}' filey >/dev/null 

real    0m7.441s
user    0m7.334s
sys     0m0.060s

$ time awk '{n=length($0); for(i=1;i<=n;i++) c[substr($0,i,1)]++} END{for(k in c) print k,c[k]}' filey >/dev/null

real    0m5.055s
user    0m4.979s
sys     0m0.047s

for the test file

$ wc filey

  118098   649539 16828965 filey

it surprised me that substr is faster than split. Perhaps due to array allocation.

edited Feb 15 '19 at 17:08

answered Feb 15 '19 at 15:51

karakfa

66,216
7
41
56

3

You should mention that splitting on a null char is undefined behavior per POSIX so it'll only work in some awks, e.g. GNU awk. – Ed Morton Feb 15 '19 at 15:59
@Tiw You misunderstood my comment. In `split($0,a,"")` the problem isn't with `$0` being a null string, it's with `""` being a null string. Using a null string as the "regexp" to split on is undefined behavior per POSIX. That's true whether it's the value of FS or the 3rd arg to split(). – Ed Morton Feb 15 '19 at 16:52
1

Oh, you mean that... I'm sorry I did misunderstood. Tried it on Freebsd's `awk` and it worked well though. @EdMorton – Til Feb 15 '19 at 16:54
1

@Tiw Yes some awks will do what you want but others won't. See [the POSIX awk spec](http://pubs.opengroup.org/onlinepubs/9699919799/) - when discussing FS `If FS is a null string, the behavior is unspecified.` and later in the `split(...,fs)` synopsis `The effect of a null string as the value of fs is unspecified.`. – Ed Morton Feb 15 '19 at 16:58
1

Also added some timings to my answer. – kvantour Feb 15 '19 at 17:14

How can I count the frequency of letters

4 Answers4

Linked