3

This is the simple version of my code.

    for i in range(len(holdList)):
        foundTerm = re.findall(r"\b" + self._searchTerm +
            r"\b", holdList[i][5], flags=re.IGNORECASE)
        # count the occurrence
        storyLen = len(foundTerm)
        holdList[i] += (storyLen,)
        if foundTerm:
            # Stores each found word as a list of strings
            # etc
            holdList[i] += (self.sentences_to_quote(holdList[i][5]), )

During the loop(the last line) I call upon a different method to look through each sentence and it returns that sentence that has the word. The holdList is a tuple from a MySQL query.

def sentences_to_quote(self, chapter):
    """
    Seperates the chapter into sentences
    Returns the first occurrence of the word in the sentence
    """

    # Seperate the chapters into sentences
    searchSentences = sent_tokenize.tokenize(chapter, realign_boundaries=True)
    findIt = r"\b" + self._searchTerm + r"\b"
    for word in searchSentences:
        regex = (re.sub(findIt,  
            "**" + self._searchTerm.upper() + "**", 
            word, flags=re.IGNORECASE))
        if regex != word:
            return regex

What can I do to speed this up? Is there anything I can do? The program is going through 10MB of text. Through profiling I found these two areas to be the bottleneck. I hope I provided enough info to make it clear.

Display_Here
  • 311
  • 1
  • 10
  • is tokenization fast and/or each row in `holdList` contains different chapter? if it's slow and you do it for the same chapter over and over, have a look at [memoization](http://stackoverflow.com/questions/1988804/what-is-memoization-and-how-can-i-use-it-in-python) techniques.. – Aprillion May 26 '14 at 10:19
  • This is too vague, your performance problems could be in using `+` for string concatenation, using uncompiled regular expressions, for using too many attribute lookups (`.`) in a loop and so on... How did you profile this and how exactly do you know that its the regexp that is slow? – Davor Lucic May 26 '14 at 10:26

2 Answers2

2

I'm not sure whether your self._searchTerm will consist of phrases or words but in general you will get much better results from using sets and dicts rather than regex. You don't need the regex machinery in this case since all you want is to count/match complete words. To search for a certain word in a sentence, for example, you can easily replace this by:

search_sentence = set(sent_tokenize.tokenize(...))
if self._search_term in search_sentence:
    # yay

(I made your code PEP8 compliant.)

If you're worried about capitalization then convert everything to lower case:

self._search_term = self._search_term.lower()
search_sentence = set(word.lower() for word in sent_tokenize.tokenize(...))
if self._search_term in search_sentence:
    # yay

You can also count occurrences of words using a collection.Counter or collection.defaultdict(int).

If you must use regex because you want to match words that follow a specific pattern rather than matching entire words then I suggest you compile the pattern once and then pass that pattern to the other methods, e.g.,

self.search_pattern = re.compile(r"\b{term}\b".format(term=self._search_term), re.I)
found_term = self.search_pattern.find_all(hold_list[i][5])
Midnighter
  • 3,771
  • 2
  • 29
  • 43
  • +1 for replacing `re.sub` by `in` set operator, note this might be faster - `search_sentence = set(sent_tokenize.tokenize(capter.lower(), ...))` – Aprillion May 26 '14 at 10:23
  • Ah thank you for the obvious on how question I wondered on how to re.compile before assignment. Will setS and dictS work though if the searchTerm is a phrase? Searches are not limited in char size or word amount. – Display_Here May 26 '14 at 11:04
  • Sets and dictionaries work for phrases too, since any string can be a dictionary key or set element, it's just not as straightforward to search for them. If you are searching for a three-word phrase, for example, you might have to build your set from overlapping three-word phrases in order to check whether the set contains the search phrase. There are probably smarter ways to do that sort of search. – Midnighter May 26 '14 at 11:24
1

re.sub is used to replace the string if it matches the regex. your task here is only to find if a match exists, hence instead using re.search would give you a performance boost, re.search gives you the first match.

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • I see, would it be that noticeable though? Right now I currently do that so the returned sentenced would have the word bolded but of course if it's a big difference then it's worth it. – Display_Here May 26 '14 at 09:10
  • 3
    Sorry, stupid of me to ask. The performance was the same though. – Display_Here May 26 '14 at 09:18