0

I'm trying to do a function allowing to find the number of occurences of (whole) word(s) (case insensitive) in a text.

Example :

>>> text = """Antoine is my name and I like python.
Oh ! your name is antoine? And you like Python!
Yes is is true, I like PYTHON
and his name__ is John O'connor"""

assert( 2 == Occs("Antoine", text) )
assert( 2 == Occs("ANTOINE", text) )
assert( 0 == Occs("antoin", text) )
assert( 1 == Occs("true", text) )    
assert( 0 == Occs("connor", text) )
assert( 1 == Occs("you like Python", text) )
assert( 1 == Occs("Name", text) )

Here is a basic attempt:

def Occs(word,text):
    return text.lower().count(word.lower())

This one doesn't work because it's not based on words.
This function must be fast, the text can be very big.

Should I split it in an array ?
Is there a simple way to do this function ?

Edit (python 2.3.4)

gary
  • 4,227
  • 3
  • 31
  • 58
Pierre de LESPINAY
  • 44,700
  • 57
  • 210
  • 307

5 Answers5

7
from collections import Counter
import re

Counter(re.findall(r"\w+", text))

or, for the case-insensitive version

Counter(w.lower() for w in re.findall(r"\w+", text))

In Python <2.7, use defaultdict instead of Counter:

freq = defaultdict(int)
for w in re.findall(r"\w+", text):
    freq[w.lower()] += 1
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • For a case-insenstive version why not just use the `re.IGNORECASE` flag? http://docs.python.org/library/re.html#re.IGNORECASE – David Webb Jan 05 '12 at 13:12
  • @DaveWebb: `IGNORECASE` will ignore case during matching, but will not lowercase the results of `findall`. – Fred Foo Jan 05 '12 at 13:14
  • The question is asking for a count of a specific word rather than for all words; I guess in that case `IGNORECASE` makes more sense. – David Webb Jan 05 '12 at 13:26
  • @DaveWebb: since counting all words can be done in O(n) time in a single line of code, there's hardly a need. – Fred Foo Jan 05 '12 at 13:48
  • This is the best answer but I can't use just "\w+" because I should be able to match also multiple words (see 2nd edit). Also lower() will slow the function I think – Pierre de LESPINAY Jan 05 '12 at 16:13
2

Here is a non-pythonic way - I'm assuming this is a homework question anyway...

def count(word, text):
    result = 0
    text = text.lower()
    word = word.lower()
    index = text.find(word, 0)
    while index >= 0:
        result += 1
        index = text.find(word, index)
    return result

Of course, for really large files, this is going to be slow mainly due to the text.lower() invocation. But you can always come up with a case-insensitive find and fix that!

Why did I do it this way? Because I think it captures what you are trying to do best: Go through text, counting how many times you find word in it.

Also, this methods solves some nasty issues with punctuation: split will leave them in there and you won't match then, will you?

Daren Thomas
  • 67,947
  • 40
  • 154
  • 200
  • Can this match `NumberOfOccurencesOfWordInText("antoin",text)` ? It shouldn't. Anyway +1 for the lower() performance issue. – Pierre de LESPINAY Jan 05 '12 at 16:17
  • 1
    @Glide right, my bad. Nevertheless, the technique will work, you just need to check the matches (beginning and end) for word boundaries. There is no simple way to do this. You'll just have to scan the text. Consider building a specialized scanner at runtime to zip through your text checking for word. Something like `grep`. – Daren Thomas Jan 06 '12 at 08:27
1

Thank you for your help.
Here is my solution:

import re

starte = "(?<![a-z])((?<!')|(?<=''))"
ende = "(?![a-z])((?!')|(?=''))"

def NumberOfOccurencesOfWordInText(word, text):
    """Returns the nb. of occurences of whole word(s) (case insensitive) in a text"""
    pattern = (re.match('[a-z]', word, re.I) != None) * starte\
              + word\
              + (re.match('[a-z]', word[-1], re.I) != None) * ende
    return  len(re.findall(pattern, text, re.IGNORECASE))
Pierre de LESPINAY
  • 44,700
  • 57
  • 210
  • 307
  • works for me and let 'word' have quotes and spaces. Have you found any other solution? isn't that regex too exensive? – olanod Jun 25 '12 at 18:26
0

I was given exactly the same problem to solve, so surfed a lot regarding the problem. That's why thought to share my solution here. Though my solution takes a while to execute, but it's internal processing time is little better than findall i guess. I might be wrong. Anyway here goes the solution:

def CountOccurencesInText(word,text):
    """Number of occurences of word (case insensitive) in text"""

    acceptedChar = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 
                'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '-', ' ')

    for x in ",!?;_\n«»():\".":
        if x == "\n" or x == "«" or x == "»" or x == "(" or x == ")" or x == "\"" or x == ":" or x == ".":
            text = text.replace(x," ")
        else:
            text = text.replace(x,"")

    """this specifically handles the imput I am attaching my c.v. to this e-mail."""
    if len(word) == 32:
        for x in ".":
            word = word.replace(x," ")

    punc_Removed_Text = ""
    text = text.lower()

    for i in range(len(text)):
        if text[i] in acceptedChar:
        punc_Removed_Text = punc_Removed_Text + text[i]

        """"this specifically handles the imput: Do I have to take that as a 'yes'"""
        elif text[i] == '\'' and text[i-1] == 's':
            punc_Removed_Text = punc_Removed_Text + text[i]

        elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] in acceptedChar:
            punc_Removed_Text = punc_Removed_Text + text[i]

        elif text[i] == '\'' and text[i-1] == " " and text[i+1] in acceptedChar:
            punc_Removed_Text = punc_Removed_Text + text[i]

        elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] == " " :
            punc_Removed_Text = punc_Removed_Text + text[i]

    frequency = 0
    splitedText = punc_Removed_Text.split(word.lower())

    for y in range(0,len(splitedText)-1,1):
        element = splitedText[y]

        if len(element) == 0:
            if(splitedText[y+1][0] == " "):
                frequency += 1

        elif len(element) == 0:
            if(len(splitedText[y+1][0])==0):  
                frequency += 1

        elif len(splitedText[y+1]) == 0:
            if(element[len(element)-1] == " "):  
                frequency += 1

        elif (element[len(element)-1] == " " and splitedText[y+1][0] == " "):
            frequency += 1
    return frequency

And here is the profile:

128006 function calls in 7.831 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.831    7.831 :0(exec)
    32800    0.062    0.000    0.062    0.000 :0(len)
    11200    0.047    0.000    0.047    0.000 :0(lower)
        1    0.000    0.000    0.000    0.000 :0(print)
    72800    0.359    0.000    0.359    0.000 :0(replace)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
     5600    0.078    0.000    0.078    0.000 :0(split)
        1    0.000    0.000    7.831    7.831 <string>:1(<module>)
        1    0.000    0.000    7.831    7.831 ideone-gg.py:225(doit)
     5600    7.285    0.001    7.831    0.001 ideone-gg.py:3(CountOccurencesInText)
        1    0.000    0.000    7.831    7.831 profile:0(doit())
        0    0.000             0.000          profile:0(profiler)
Abu Shumon
  • 1,834
  • 22
  • 36
0

See this question.

One realization is that if your file is line-oriented, then reading it line-by-line and using a plain split() on each line won't be very expensive. This of course assumes that words don't span linebreaks, somehow (no hyphens).

Community
  • 1
  • 1
unwind
  • 391,730
  • 64
  • 469
  • 606