How to write a alphabet bigram (aa, ab, bc, cd ... zz) frequency analysis counter in python?

Question

This is my current code which prints out the frequency of each character in the input file.

from collections import defaultdict

counters = defaultdict(int)
with open("input.txt") as content_file:
   content = content_file.read()
   for char in content:
       counters[char] += 1

for letter in counters.keys():
    print letter, (round(counters[letter]*100.00/1234,3))

I want it to print the frequency of bigrams of only the alphabets(aa,ab,ac ..zy,zz) and not the punctuation as well. How to do this?

https://meta.stackoverflow.com/questions/378440/caveat-emptor-making-students-aware-they-cannot-delete-their-homework-questions — pault, Jan 12 '19 at 18:19

score 0 · Answer 1 · answered Jan 12 '19 at 17:59

You can build around the current code to handle pairs as well. Keep track of 2 characters instead of just 1 by adding another variable, and use a check to eliminate non alphabets.

from collections import defaultdict

counters = defaultdict(int)
paired_counters = defaultdict(int)
with open("input.txt") as content_file:
   content = content_file.read()
   prev = '' #keeps track of last seen character
   for char in content:
       counters[char] += 1
       if prev and (prev+char).isalpha(): #checks for alphabets.
           paired_counters[prev+char] += 1
       prev = char #assign current char to prev variable for next iteration

for letter in counters.keys(): #you can iterate through both keys and value pairs from a dictionary instead using .items in python 3 or .iteritems in python 2.
    print letter, (round(counters[letter]*100.00/1234,3)) 

for pairs,values in paired_counters.iteritems(): #Use .items in python 3. Im guessing this is python2.
    print pairs, values

(disclaimer: i do not have python 2 on my system. if there is an issue in the code let me know.)

DYZ · Answer 2 · 2019-01-13T01:42:41.733

There is a more efficient way of counting bigraphs: with a Counter. Start by reading the text (assuming it is not too large):

from collections import Counter
with open("input.txt") as content_file:
   content = content_file.read()

Filter out non-letters:

letters = list(filter(str.isalpha, content))

You probably should convert all letters to the lower case, too, but it's up to you:

letters = letters.lower()

Build a zip of the remaining letters with itself, shifted by one position, and count the bigraphs:

cntr = Counter(zip(letters, letters[1:]))

Normalize the dictionary:

total = len(cntr)
{''.join(k): v / total for k,v in cntr.most_common()}
#{'ow': 0.1111111111111111, 'He': 0.05555555555555555...}

The solution can be easily generalized to trigraphs, etc., by changing the counter:

cntr = Counter(zip(letters, letters[1:], letters[2:]))

alvas · Answer 3 · 2019-01-14T08:24:16.923

If you're using nltk:

from nltk import ngrams
list(ngrams('hello', n=2))

[out]:

[('h', 'e'), ('e', 'l'), ('l', 'l'), ('l', 'o')]

To do a count:

from collections import Counter
Counter(list(ngrams('hello', n=2)))

If you want a python native solution, take a look at:

How to write a alphabet bigram (aa, ab, bc, cd ... zz) frequency analysis counter in python?

3 Answers3