0

I have a text file which I would like to count the number of occurrences of each character in the file

Below is an example of what my file look like

#1=DBD?BFHH=FIIIHIIGIHGHHIIIIIIIIGG?CHIIIAGGGHIGHEEHB@BDBCEDDDDD@CCA>?A>@C>:<?CCDDDDD@CD@DCBD9?CCDCB@
#1=DDFFFHFDHHIIIIJJIGHJIJGIIIIEGHGHJJBFGFHEIEEG@FFHJ.=EHHHABDDDBCCECEEEEDCBDEDDDDDDDDCDD?B9B:A:@?CCCD

So the output would be:

E - 10, C - 20, (#) - 10, 3 - 9
etc etc...

I hope I was clear enough in what I want.

Thanks!

BMW
  • 42,880
  • 12
  • 99
  • 116
Sinan
  • 1
  • 3
  • I am somewhat new at awk and have spent quite sometime reading up on it and searching for some solutions to what I was looking for. There is was more to it but I got the first half on my own and the question I posted was the second half. – Sinan Dec 07 '14 at 18:40

4 Answers4

1
$ awk '{for (i=1; i<=NF; i++){a[$i]++}}END{for (i in a){print i, a[i]}}' FS= file
A 5
B 13
C 20
D 36
E 14
9 2
F 10
: 3
G 14
. 1
H 21
< 1
I 29
J 7
= 4
# 2
> 3
1 2
? 7
@ 8
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 1
    IMO, it's somewhat cleaner to put the `FS=""` in a `BEGIN` block. It avoids dealing with shell details. – D.Shawley Dec 07 '14 at 20:46
  • @D.Shawley you would be in the minority. If you are writing an awk script it makes sense, but on the command line all it does it add 8 characters to do the same thing. – Zombo Dec 07 '14 at 22:46
  • @StevenPenny fair enough. I also add newlines when I write commands on the command line so my "one liner" awk command lines tend to span multiple lines ;) – D.Shawley Dec 08 '14 at 01:42
1

If you need count the letter on all lines:

sed 's/\(.\)/\1\n/g' infile|sort |uniq -c |sort -n

      1 .
      1 <
      2
      2 #
      2 1
      2 9
      3 :
      3 >

If you need count the letter on each line:

awk -v FS="" '{delete a;for (i=1;i<=NF;i++) a[$i]++;for (i in a) printf "%s - %s, ",i,a[i];printf RS}' infile

A - 3, B - 7, C - 12, D - 17, E - 3, 9 - 1, F - 2, : - 1, G - 8, H - 10, < - 1, I - 18, = - 2, # - 1, > - 3, 1 - 1, ? - 5, @ - 6,
A - 2, B - 6, C - 8, D - 19, E - 11, 9 - 1, F - 8, : - 2, G - 6, . - 1, H - 11, I - 11, J - 7, = - 2, # - 1, 1 - 1, ? - 2, @ - 2,
BMW
  • 42,880
  • 12
  • 99
  • 116
0

Perl is very good for this kind of thing. Read the file as a single string, remove newlines, count the letters, output the results sorted by letter.

perl -0777 -nE 's/\n//g; $c{$_}++ for split //; say "$_ $c{$_}" for sort keys %c' file
# 2
. 1
1 2
9 2
: 3
< 1
= 4
> 3
? 7
@ 8
A 5
B 13
C 20
D 36
E 14
F 10
G 14
H 21
I 29
J 7
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
0

GNU awk 4.1

awk -iwalkarray '{for (;NF;NF--) b[$NF]++} END {walk_array(b)}' FS=
[A] = 5
[B] = 13
[C] = 20
[D] = 36
[E] = 14
[F] = 10
[9] = 2
[G] = 14
[:] = 3
[.] = 1
[H] = 21
[I] = 29
[<] = 1
[J] = 7
[#] = 2
[=] = 4
[1] = 2
[>] = 3
[?] = 7
[@] = 8

If you have earlier version of GNU awk you can use for (c in b) print c, b[c]. I noticed that walk_array had never been used on Stack Overflow so I did it for fun. I found my awk files at /usr/share/awk and /usr/lib/gawk

awk save modifications in place

Community
  • 1
  • 1
Zombo
  • 1
  • 62
  • 391
  • 407
  • 2
    I was just to ask about that. I do not find any documentation about the `walk_array` in the `gnu awk` manual. Can you point me in correct direction? Like to learn :) – Jotne Dec 07 '14 at 09:33
  • Can you try this and see if it works: `{while(--NF) z[$NF]++}`? – Jotne Dec 07 '14 at 09:45
  • @Jotne that cannot work because you will lose the last character on each line – Zombo Dec 07 '14 at 09:48