30

I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.

e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:

"the quick person did not realize his speed and the quick person bumped "

For above file, the bigram set and their count will be :

(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1

I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.

The example goes like this:

>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)

The output of above code is :

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
 ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
swap310
  • 768
  • 2
  • 8
  • 22

6 Answers6

52

Some itertools magic:

>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+", 
   "the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))

Output:

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
  ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
  ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
  ('realize', 'his'): 1})

Bonus

Get the frequency of any n-gram:

from itertools import tee, islice

def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

>>> Counter(ngrams(words, 3))

Output:

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
  ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
  ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
  ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
  ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms to consume lazily without reading the whole file in memory.

Will Beason
  • 3,417
  • 2
  • 28
  • 46
Abhinav Sarkar
  • 23,534
  • 11
  • 81
  • 97
  • The itertools ngram function is great! However, if you need to perform additional text-analyses it might be worth checking out [TextBlob](https://textblob.readthedocs.io/en/dev/). It also has a TextBlob.ngrams() function which basically does the same thing. I have tested both the itertools and the TextBlob function and they seem to perform with comparable speed and results (a very minor advantage for the itertools function). – Montmons Mar 24 '17 at 11:47
  • Oops, I forgot to include counting the ngrams in my comparison, the TextBlob function does not do that by itself. I've tried to write a function with Counter for it, but overall that makes it a much slower option. So.. itertools wins. – Montmons Mar 24 '17 at 12:17
  • This is quite clever. FWIW what it does is the following: L1 is `words`, L2 is `islice(words, 1, None)` which slices the sentence into individual words beginning with the second word. `izip(words, islice(words, 1, None))` then zips L1 up with L2 so that "the" from L1 gets matched with "quick" from L2, "quick" from L1 gets matched with "person" from L2, etc. Counter then counts the pairs. And for Python3 you no longer need to import `izip`, just use `zip`. The answer below from @st0le actually does the same thing. – Casey L Mar 02 '19 at 20:13
14

How about zip()?

import re
from collections import Counter
words = re.findall('\w+', open('a.txt').read())
print(Counter(zip(words,words[1:])))
st0le
  • 33,375
  • 8
  • 89
  • 89
5

You can simply use Counter for any n_gram like so:

from collections import Counter
from nltk.util import ngrams 

text = "the quick person did not realize his speed and the quick person bumped "
n_gram = 2
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the'): 1,
         ('did', 'not'): 1,
         ('his', 'speed'): 1,
         ('not', 'realize'): 1,
         ('person', 'bumped'): 1,
         ('person', 'did'): 1,
         ('quick', 'person'): 2,
         ('realize', 'his'): 1,
         ('speed', 'and'): 1,
         ('the', 'quick'): 2})

For 3-grams, just change the n_gram to 3:

n_gram = 3
Counter(ngrams(text.split(), n_gram))
>>>
Counter({('and', 'the', 'quick'): 1,
         ('did', 'not', 'realize'): 1,
         ('his', 'speed', 'and'): 1,
         ('not', 'realize', 'his'): 1,
         ('person', 'did', 'not'): 1,
         ('quick', 'person', 'bumped'): 1,
         ('quick', 'person', 'did'): 1,
         ('realize', 'his', 'speed'): 1,
         ('speed', 'and', 'the'): 1,
         ('the', 'quick', 'person'): 2})
Kristada673
  • 3,512
  • 6
  • 39
  • 93
  • 1
    this is fine but is missing an import - you need to add `from nltk.util import ngrams`. FWIW it appears to run a little faster than the accepted solution. – Casey L Mar 02 '19 at 20:07
5

Starting in Python 3.10, the new pairwise function provides a way to slide through pairs of consecutive elements, such that your use-case simply becomes:

from itertools import pairwise
import re
from collections import Counter

# text = "the quick person did not realize his speed and the quick person bumped "
Counter(pairwise(re.findall('\w+', text)))
# Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('realize', 'his'): 1, ('his', 'speed'): 1, ('speed', 'and'): 1, ('and', 'the'): 1, ('person', 'bumped'): 1})

Details for intermediate results:

re.findall('\w+', text)
# ['the', 'quick', 'person', 'did', 'not', 'realize', 'his', ...]
pairwise(re.findall('\w+', text))
# [('the', 'quick'), ('quick', 'person'), ('person', 'did'), ...]
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
1

It has been long time since this question was asked and successfully responded. I benefit from the responses to create my own solution. I would like to share it:

    import regex
    bigrams_tst = regex.findall(r"\b\w+\s\w+", open(myfile).read(), overlapped=True)

This will provide all bigrams that do not interrupted by a punctuation.

hurrial
  • 484
  • 4
  • 9
0

One can use CountVectorizer from scikit-learn (pip install sklearn) to generate the bigrams (or more generally, any ngram).

Example (tested with Python 3.6.7 and scikit-learn 0.24.2).

import sklearn.feature_extraction.text

ngram_size = 2
train_set = ['the quick person did not realize his speed and the quick person bumped']

vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vectorizer.fit(train_set) # build ngram dictionary
ngram = vectorizer.transform(train_set) # get ngram
print('ngram: {0}\n'.format(ngram))
print('ngram.shape: {0}'.format(ngram.shape))
print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))

Output:

>>> print('ngram: {0}\n'.format(ngram)) # Shows the bi-gram count
ngram:   (0, 0) 1
  (0, 1)        1
  (0, 2)        1
  (0, 3)        1
  (0, 4)        1
  (0, 5)        1
  (0, 6)        2
  (0, 7)        1
  (0, 8)        1
  (0, 9)        2

>>> print('ngram.shape: {0}'.format(ngram.shape))
ngram.shape: (1, 10)
>>> print('vectorizer.vocabulary_: {0}'.format(vectorizer.vocabulary_))
vectorizer.vocabulary_: {'the quick': 9, 'quick person': 6, 'person did': 5, 'did not': 1, 
'not realize': 3, 'realize his': 7, 'his speed': 2, 'speed and': 8, 'and the': 0, 
'person bumped': 4}
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501