This is the simple version of my code.
for i in range(len(holdList)):
foundTerm = re.findall(r"\b" + self._searchTerm +
r"\b", holdList[i][5], flags=re.IGNORECASE)
# count the occurrence
storyLen = len(foundTerm)
holdList[i] += (storyLen,)
if foundTerm:
# Stores each found word as a list of strings
# etc
holdList[i] += (self.sentences_to_quote(holdList[i][5]), )
During the loop(the last line) I call upon a different method to look through each sentence and it returns that sentence that has the word. The holdList is a tuple from a MySQL query.
def sentences_to_quote(self, chapter):
"""
Seperates the chapter into sentences
Returns the first occurrence of the word in the sentence
"""
# Seperate the chapters into sentences
searchSentences = sent_tokenize.tokenize(chapter, realign_boundaries=True)
findIt = r"\b" + self._searchTerm + r"\b"
for word in searchSentences:
regex = (re.sub(findIt,
"**" + self._searchTerm.upper() + "**",
word, flags=re.IGNORECASE))
if regex != word:
return regex
What can I do to speed this up? Is there anything I can do? The program is going through 10MB of text. Through profiling I found these two areas to be the bottleneck. I hope I provided enough info to make it clear.