best way count the number of matches between the list and the string in python

Question

What is the best way to count the number of matches between the list and the string in python??

for example if I have this list:

list = ['one', 'two', 'three']

and this string:

line = "some one long. two phrase three and one again"

I want to get 4 because I have

one 2 times
two 1 time
three 1 time

I try below code based on this question answers and it's worked but I got error if I add many many words (4000 words) to list:

import re
word_list = ['one', 'two', 'three']
line = "some one long. two phrase three and one again"
words_re = re.compile("|".join(word_list))
print(len(words_re.findall(line)))

This is my error:

words_re = re.compile("|".join(word_list))
  File "/usr/lib/python2.7/re.py", line 190, in compile

I tried your list times a million with `re.compile("|".join(word_list * 1000000))` using Python 2.7.6 and I get no such error. The problem might be your word_list that needs `re.escape` for each word. — cr3, Dec 25 '15 at 16:04
Thanks for your attention. I used .split() function to create my word list. Please give more detail about ```re.escape``` if it's possible. — b24, Dec 25 '15 at 16:07
The error apparently caused by the size of the list might actually be caused by a word containing an invalid regular expression in the 4000 words list. So, each word should be escaped like this: `words_re = re.compile("|".join([re.escape(word) for word in word_list]))` — cr3, Dec 25 '15 at 16:08
@cr3 our comment code worked. Please post it as answer and please compare regex based solution (your answer) and Malik Brahimi answer. Thanks — b24, Dec 25 '15 at 16:35

Padraic Cunningham · Answer 1 · 2015-12-25T16:45:15.320

If you want case insensitive and to match whole words ignoring punctuation, split the string and strip the punctuation using a dict to store the words you want to count:

lst = ['one', 'two', 'three']
from string import punctuation
cn = dict.fromkeys(lst, 0)
line = "some one long. two phrase three and one again"

for word in line.lower().split():
    word = word.strip(punctuation)
    if word in cn:
        cn[word] += 1


print(cn)

{'three': 1, 'two': 1, 'one': 2}

If you just want the sum use a set with the same logic:

from string import punctuation

st = {'one', 'two', 'three'}
line = "some one long. two phrase three and one again"

print(sum(word.strip(punctuation) in st for word in line.lower().split()))

This does a single pass over the the words after they are split, the set lookup is 0(1) so it is substantially more efficient than list.count.

best way count the number of matches between the list and the string in python

1 Answers1