I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.
e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:
"the quick person did not realize his speed and the quick person bumped "
For above file, the bigram set and their count will be :
(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1
I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.
The example goes like this:
>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)
The output of above code is :
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.