2

What is the best way to count the number of matches between the list and the string in python??

for example if I have this list:

list = ['one', 'two', 'three']

and this string:

line = "some one long. two phrase three and one again"

I want to get 4 because I have

one 2 times
two 1 time
three 1 time

I try below code based on this question answers and it's worked but I got error if I add many many words (4000 words) to list:

import re
word_list = ['one', 'two', 'three']
line = "some one long. two phrase three and one again"
words_re = re.compile("|".join(word_list))
print(len(words_re.findall(line)))

This is my error:

words_re = re.compile("|".join(word_list))
  File "/usr/lib/python2.7/re.py", line 190, in compile
Community
  • 1
  • 1
b24
  • 2,425
  • 6
  • 30
  • 51
  • 1
    I tried your list times a million with `re.compile("|".join(word_list * 1000000))` using Python 2.7.6 and I get no such error. The problem might be your word_list that needs `re.escape` for each word. – cr3 Dec 25 '15 at 16:04
  • Thanks for your attention. I used .split() function to create my word list. Please give more detail about ```re.escape``` if it's possible. – b24 Dec 25 '15 at 16:07
  • 1
    The error apparently caused by the size of the list might actually be caused by a word containing an invalid regular expression in the 4000 words list. So, each word should be escaped like this: `words_re = re.compile("|".join([re.escape(word) for word in word_list]))` – cr3 Dec 25 '15 at 16:08
  • @cr3 our comment code worked. Please post it as answer and please compare regex based solution (your answer) and Malik Brahimi answer. Thanks – b24 Dec 25 '15 at 16:35

1 Answers1

1

If you want case insensitive and to match whole words ignoring punctuation, split the string and strip the punctuation using a dict to store the words you want to count:

lst = ['one', 'two', 'three']
from string import punctuation
cn = dict.fromkeys(lst, 0)
line = "some one long. two phrase three and one again"

for word in line.lower().split():
    word = word.strip(punctuation)
    if word in cn:
        cn[word] += 1


print(cn)

{'three': 1, 'two': 1, 'one': 2}

If you just want the sum use a set with the same logic:

from string import punctuation

st = {'one', 'two', 'three'}
line = "some one long. two phrase three and one again"

print(sum(word.strip(punctuation) in st for word in line.lower().split()))

This does a single pass over the the words after they are split, the set lookup is 0(1) so it is substantially more efficient than list.count.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321