Using pandas in Python 2.7 I am attempting to count the number of times a phrase (e.g., "very good") appears in pieces of text stored in a CSV file. I have multiple phrases and multiple pieces of text. I have succeeded in this first part using the following code:
for row in df_book.itertuples():
index, text = row
normed = re.sub(r'[^\sa-zA-Z0-9]', '', text).lower().strip()
for row in df_phrase.itertuples():
index, phrase = row
count = sum(1 for x in re.finditer(r"\b%s\b" % (re.escape(phrase)), normed))
file.write("%s," % (count))
However, I don't want to count the phrase if it's preceded by a different phrase (e.g., "it is not"). Therefore I used a negative lookbehind assertion:
for row in df_phrase.itertuples():
index, phrase = row
for row in df_negations.itertuples():
index, negation = row
count = sum(1 for x in re.finditer(r"(?<!%s )\b%s\b" % (negation, re.escape(phrase)), normed))
The problem with this approach is that it records a value for each and every negation as pulled from the df_negations dataframe. So, if finditer doesn't find "it was not 'very good'", then it will record a 0. And so on for every single possible negation.
What I really want is just an overall count for the number of times a phrase was used without a preceding phrase. In other words, I want to count every time "very good" occurs, but only when it's not preceded by a negation ("it was not") on my list of negations.
Also, I'm more than happy to hear suggestions on making the process run quicker. I have 100+ phrases, 100+ negations, and 1+ million pieces of text.