Bash script to find the frequency of every letter in a file

Question

I am trying to find out the frequency of appearance of every letter in the english alphabet in an input file. How can I do this in a bash script?

Found this programming question somewhere!! I guess perl would be the better alternative, isn't it? — SkypeMeSM, Oct 19 '10 at 09:58

score 36 · Answer 1 · answered Oct 19 '10 at 12:03

36

My solution using grep, sort and uniq.

grep -o . file | sort | uniq -c

Ignore case:

grep -o . file | sort -f | uniq -ic

answered Oct 19 '10 at 12:03

dogbane

266,786
75
396
414

how can I get frequency / sum(all frequencies) after this? – SkypeMeSM Oct 19 '10 at 12:19
@SkypeMeSM to get frequency of each character, just divide by the total number of characters (which is given by `wc -c file`). – Antoine Pinsard Apr 17 '16 at 09:02

ghostdog74 · Accepted Answer · 2010-10-19T12:36:26.503

26

Just one awk command

awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file

if you want case insensitive, add tolower()

awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file

and if you want only characters,

awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file

and if you want only digits, change /[a-zA-Z]/ to /[0-9]/

if you do not want to show unicode, do export LC_ALL=C

edited Oct 19 '10 at 12:36

answered Oct 19 '10 at 09:21

ghostdog74

327,991
56
259
343

I am sorry I am not very familiar with awk. The solution works but I am getting all characters instead of just alphanumeric characters. awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++ sum++ } END{for(i in w) print i,w[i],w[i]/sum}' – SkypeMeSM Oct 19 '10 at 10:10
Thanks again. I am wondering why I get results like ü 2 and é 2, when the regex is [a-zA-Z]. – SkypeMeSM Oct 19 '10 at 10:21
that's because gawk's regex works for unicode characters. (UTF8). – ghostdog74 Oct 19 '10 at 10:27
how can i remove them in that case? – SkypeMeSM Oct 19 '10 at 11:12
you can do a `export LC_ALL=C`. – ghostdog74 Oct 19 '10 at 12:34

mouviciel · Answer 3 · 2010-10-19T11:01:53.017

9

A solution with sed, sort and uniq:

sed 's/\(.\)/\1\n/g' file | sort | uniq -c

This counts all characters, not only letters. You can filter out with:

sed 's/\(.\)/\1\n/g' file | grep '[A-Za-z]' | sort | uniq -c

If you want to consider uppercase and lowercase as same, just add a translation:

sed 's/\(.\)/\1\n/g' file | tr '[:upper:]' '[:lower:]' | grep '[a-z]' | sort | uniq -c

edited Oct 19 '10 at 11:01

answered Oct 19 '10 at 09:28

mouviciel

66,855
13
106
140

Thanks. This considers uppercase and lowercase characters as separate. How can I calculate the frequencies where we consider A and a as same? – SkypeMeSM Oct 19 '10 at 09:42
Yes this works great as well. I am wondering how can I calculate the probabilities i.e. frequency/total sum. We will need to pipe the output again to sed again but I cannot figure out the regex involved? – SkypeMeSM Oct 19 '10 at 11:22
You can add some `wc`, `cut`, `dc`, `tee` and other commands but it would be more juggling with plates than a maintainable work. I think that adding more features would be easier with a perl script. – mouviciel Oct 19 '10 at 11:43

Benoit · Answer 4 · 2010-10-19T09:23:38.180

4

Here is a suggestion:

while read -n 1 c
do
    echo "$c"
done < "$INPUT_FILE" | grep '[[:alpha:]]' | sort | uniq -c | sort -nr

edited Oct 19 '10 at 09:23

answered Oct 19 '10 at 09:17

Benoit

76,634
23
210
236

score 0 · Answer 5 · answered May 15 '13 at 13:46

Similar to mouviciel's answer above, but more generic for Bourne and Korn shells used on BSD systems, when you don't have GNU sed, which supports \n in a replacement, you can backslash escape a newline:

sed -e's/./&\
/g' file | sort | uniq -c | sort -nr

or to avoid the visual split on the screen, insert a literal newline by type CTRL+V CTRL+J

sed -e's/./&\^J/g' file | sort | uniq -c | sort -nr

Bash script to find the frequency of every letter in a file

5 Answers5

Linked