Python - Number of word occurrences

Question

I'm trying to do a function allowing to find the number of occurences of (whole) word(s) (case insensitive) in a text.

Example :

>>> text = """Antoine is my name and I like python.
Oh ! your name is antoine? And you like Python!
Yes is is true, I like PYTHON
and his name__ is John O'connor"""

assert( 2 == Occs("Antoine", text) )
assert( 2 == Occs("ANTOINE", text) )
assert( 0 == Occs("antoin", text) )
assert( 1 == Occs("true", text) )    
assert( 0 == Occs("connor", text) )
assert( 1 == Occs("you like Python", text) )
assert( 1 == Occs("Name", text) )

Here is a basic attempt:

def Occs(word,text):
    return text.lower().count(word.lower())

This one doesn't work because it's not based on words.
This function must be fast, the text can be very big.

Should I split it in an array ?
Is there a simple way to do this function ?

Edit (python 2.3.4)

Regular expressions ? http://docs.python.org/howto/regex.html — Li0liQ, Jan 05 '12 at 12:51
How many queries do you have? If you have a lot of them I would suggest you to split lowercased text into words (O(n)), sort them and search in the resulted list (binary search + iteration over adjacent itmes) — Nikolay Polivanov, Jan 05 '12 at 12:56
@Nikolay Actually I can't split because more than words can be searched (see the second edit) :) — Pierre de LESPINAY, Jan 05 '12 at 16:09
@jsbueno I use python 2.7 but this is an exercise that I have to do (I don't have the choice). — Pierre de LESPINAY, Jan 05 '12 at 16:10
-1 for changing the requirement and documenting the new requirement sloppily. It seems you're after some kind of full-blown NLP application. — Fred Foo, Jan 05 '12 at 16:26
Removed 'frequency' tag since it seems it should be used for DSP/Audio questions. — gary, Jan 09 '12 at 17:27

score 7 · Accepted Answer · answered Jan 05 '12 at 12:52

7

from collections import Counter
import re

Counter(re.findall(r"\w+", text))

or, for the case-insensitive version

Counter(w.lower() for w in re.findall(r"\w+", text))

In Python <2.7, use defaultdict instead of Counter:

freq = defaultdict(int)
for w in re.findall(r"\w+", text):
    freq[w.lower()] += 1

answered Jan 05 '12 at 12:52

Fred Foo

355,277
75
744
836

For a case-insenstive version why not just use the `re.IGNORECASE` flag? http://docs.python.org/library/re.html#re.IGNORECASE – David Webb Jan 05 '12 at 13:12
@DaveWebb: `IGNORECASE` will ignore case during matching, but will not lowercase the results of `findall`. – Fred Foo Jan 05 '12 at 13:14
The question is asking for a count of a specific word rather than for all words; I guess in that case `IGNORECASE` makes more sense. – David Webb Jan 05 '12 at 13:26
@DaveWebb: since counting all words can be done in O(n) time in a single line of code, there's hardly a need. – Fred Foo Jan 05 '12 at 13:48
This is the best answer but I can't use just "\w+" because I should be able to match also multiple words (see 2nd edit). Also lower() will slow the function I think – Pierre de LESPINAY Jan 05 '12 at 16:13

score 2 · Answer 2 · answered Jan 05 '12 at 12:59

2

Here is a non-pythonic way - I'm assuming this is a homework question anyway...

def count(word, text):
    result = 0
    text = text.lower()
    word = word.lower()
    index = text.find(word, 0)
    while index >= 0:
        result += 1
        index = text.find(word, index)
    return result

Of course, for really large files, this is going to be slow mainly due to the text.lower() invocation. But you can always come up with a case-insensitive find and fix that!

Why did I do it this way? Because I think it captures what you are trying to do best: Go through text, counting how many times you find word in it.

Also, this methods solves some nasty issues with punctuation: split will leave them in there and you won't match then, will you?

answered Jan 05 '12 at 12:59

Daren Thomas

67,947
40
154
200

Can this match `NumberOfOccurencesOfWordInText("antoin",text)` ? It shouldn't. Anyway +1 for the lower() performance issue. – Pierre de LESPINAY Jan 05 '12 at 16:17
1

@Glide right, my bad. Nevertheless, the technique will work, you just need to check the matches (beginning and end) for word boundaries. There is no simple way to do this. You'll just have to scan the text. Consider building a specialized scanner at runtime to zip through your text checking for word. Something like `grep`. – Daren Thomas Jan 06 '12 at 08:27

score 1 · Answer 3 · answered Jan 06 '12 at 06:41

Thank you for your help.
Here is my solution:

import re

starte = "(?<![a-z])((?<!')|(?<=''))"
ende = "(?![a-z])((?!')|(?=''))"

def NumberOfOccurencesOfWordInText(word, text):
    """Returns the nb. of occurences of whole word(s) (case insensitive) in a text"""
    pattern = (re.match('[a-z]', word, re.I) != None) * starte\
              + word\
              + (re.match('[a-z]', word[-1], re.I) != None) * ende
    return  len(re.findall(pattern, text, re.IGNORECASE))

works for me and let 'word' have quotes and spaces. Have you found any other solution? isn't that regex too exensive? — olanod, Jun 25 '12 at 18:26

score 0 · Answer 4 · answered Feb 05 '14 at 13:08

I was given exactly the same problem to solve, so surfed a lot regarding the problem. That's why thought to share my solution here. Though my solution takes a while to execute, but it's internal processing time is little better than findall i guess. I might be wrong. Anyway here goes the solution:

def CountOccurencesInText(word,text):
    """Number of occurences of word (case insensitive) in text"""

    acceptedChar = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
                'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '-', ' ')

    for x in ",!?;_\n«»():\".":
        if x == "\n" or x == "«" or x == "»" or x == "(" or x == ")" or x == "\"" or x == ":" or x == ".":
            text = text.replace(x," ")
        else:
            text = text.replace(x,"")

    """this specifically handles the imput I am attaching my c.v. to this e-mail."""
    if len(word) == 32:
        for x in ".":
            word = word.replace(x," ")

    punc_Removed_Text = ""
    text = text.lower()

    for i in range(len(text)):
        if text[i] in acceptedChar:
        punc_Removed_Text = punc_Removed_Text + text[i]

        """"this specifically handles the imput: Do I have to take that as a 'yes'"""
        elif text[i] == '\'' and text[i-1] == 's':
            punc_Removed_Text = punc_Removed_Text + text[i]

        elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] in acceptedChar:
            punc_Removed_Text = punc_Removed_Text + text[i]

        elif text[i] == '\'' and text[i-1] == " " and text[i+1] in acceptedChar:
            punc_Removed_Text = punc_Removed_Text + text[i]

        elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] == " " :
            punc_Removed_Text = punc_Removed_Text + text[i]

    frequency = 0
    splitedText = punc_Removed_Text.split(word.lower())

    for y in range(0,len(splitedText)-1,1):
        element = splitedText[y]

        if len(element) == 0:
            if(splitedText[y+1][0] == " "):
                frequency += 1

        elif len(element) == 0:
            if(len(splitedText[y+1][0])==0):  
                frequency += 1

        elif len(splitedText[y+1]) == 0:
            if(element[len(element)-1] == " "):  
                frequency += 1

        elif (element[len(element)-1] == " " and splitedText[y+1][0] == " "):
            frequency += 1
    return frequency

And here is the profile:

128006 function calls in 7.831 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.831    7.831 :0(exec)
    32800    0.062    0.000    0.062    0.000 :0(len)
    11200    0.047    0.000    0.047    0.000 :0(lower)
        1    0.000    0.000    0.000    0.000 :0(print)
    72800    0.359    0.000    0.359    0.000 :0(replace)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     5600    0.078    0.000    0.078    0.000 :0(split)
        1    0.000    0.000    7.831    7.831 <string>:1(<module>)
        1    0.000    0.000    7.831    7.831 ideone-gg.py:225(doit)
     5600    7.285    0.001    7.831    0.001 ideone-gg.py:3(CountOccurencesInText)
        1    0.000    0.000    7.831    7.831 profile:0(doit())
        0    0.000             0.000          profile:0(profiler)

score 0 · Answer 5 · edited May 23 '17 at 12:09

0

See this question.

One realization is that if your file is line-oriented, then reading it line-by-line and using a plain split() on each line won't be very expensive. This of course assumes that words don't span linebreaks, somehow (no hyphens).

edited May 23 '17 at 12:09

Community

1
1

answered Jan 05 '12 at 12:54

unwind

391,730
64
469
606

Thank you, but it's not exclusively line-oriented – Pierre de LESPINAY Jan 05 '12 at 16:14

Python - Number of word occurrences

5 Answers5

Linked