2

Using pandas in Python 2.7 I am attempting to count the number of times a phrase (e.g., "very good") appears in pieces of text stored in a CSV file. I have multiple phrases and multiple pieces of text. I have succeeded in this first part using the following code:

for row in df_book.itertuples():
    index, text = row
    normed = re.sub(r'[^\sa-zA-Z0-9]', '', text).lower().strip()

for row in df_phrase.itertuples():
    index, phrase = row
    count = sum(1 for x in re.finditer(r"\b%s\b" % (re.escape(phrase)), normed))
    file.write("%s," % (count))

However, I don't want to count the phrase if it's preceded by a different phrase (e.g., "it is not"). Therefore I used a negative lookbehind assertion:

for row in df_phrase.itertuples():
    index, phrase = row
    for row in df_negations.itertuples():
        index, negation = row
        count = sum(1 for x in re.finditer(r"(?<!%s )\b%s\b" % (negation, re.escape(phrase)), normed))

The problem with this approach is that it records a value for each and every negation as pulled from the df_negations dataframe. So, if finditer doesn't find "it was not 'very good'", then it will record a 0. And so on for every single possible negation.

What I really want is just an overall count for the number of times a phrase was used without a preceding phrase. In other words, I want to count every time "very good" occurs, but only when it's not preceded by a negation ("it was not") on my list of negations.

Also, I'm more than happy to hear suggestions on making the process run quicker. I have 100+ phrases, 100+ negations, and 1+ million pieces of text.

Matt
  • 113
  • 3
  • 10
  • I believe you should read this: [Regex Pattern to Match, Excluding when… / Except between](http://stackoverflow.com/questions/23589174/regex-pattern-to-match-excluding-when-except-between/23589204#23589204) – Mariano Sep 11 '15 at 03:29
  • That looks right up my alley. Do you have a suggestion on how I can use that approach with a separate CSV file with all of my negations stored in each row? – Matt Sep 11 '15 at 04:06

1 Answers1

0

I don't really do pandas, but this cheesy non-Pandas version gives some results with the data you sent me.

The primary complication is that the Python re module does not allow variable-width negative look-behind assertions. So this example looks for matching phrases, saving the starting location and text of each phrase, and then, if it found any, looks for negations in the same source string, saving the ending locations of the negations. To make sure that negation ending locations are the same as phrase starting locations, we capture the whitespace after each negation along with the negation itself.

Repeatedly calling functions in the re module is fairly costly. If you have a lot of text as you say, you might want to batch it up, e.g. by using 'non-matching-string'.join() on some of your source strings.

import re
from collections import defaultdict
import csv

def read_csv(fname):
    with open(fname, 'r') as csvfile:
        result = list(csv.reader(csvfile))
    return result

df_negations = read_csv('negations.csv')[1:]
df_phrases = read_csv('phrases.csv')[1:]
df_book = read_csv('test.csv')[1:]

negations = (str(row[0]) for row in df_negations)
phrases = (str(re.escape(row[1])) for row in df_phrases)

# Add a word to the negation pattern so it overlaps the
# next group.
negation_pattern = r"\b((?:%s)\W+)" % '|'.join(negations)
phrase_pattern = r"\b(%s)\b" % '|'.join(phrases)

counts = defaultdict(int)

for row in df_book:
    normed = re.sub(r'[^\sa-zA-Z0-9]', '', row[0]).lower().strip()

    # Find the location and text of any matching good groups
    phrases = [(x.start(), x.group()) for x in
                    re.finditer(phrase_pattern, normed)]
    if not phrases:
        continue

    # If we had matches, find the (start, end) locations of matching bad
    # groups
    negated = set(x.end() for x in re.finditer(negation_pattern, normed))

    for start, text in phrases:
        if start not in negated:
            counts[text] += 1
        else:
            print("%r negated and ignored" % text)

for pattern, count in sorted(counts.items()):
    print(count, pattern)
Patrick Maupin
  • 8,024
  • 2
  • 23
  • 42
  • Since this is my first time posting, what else can I provide that would be helpful? I tried to run the code but I get the error: 'Traceback (most recent call last): File "C:\...\Extract.py", line 28, in phrase_pattern = ' |'.join(phrases) File "C:\...\Extract.py", line 26, in adjectives = (re.escape(row[1]) for row in df_phrases.itertuples()) File "C:\Python27\lib\re.py", line 210, in escape s = list(pattern) TypeError: 'numpy.int64' object is not iterable' – Matt Sep 11 '15 at 04:04
  • You have a number in one of your rows, not a string, I suppose. Try wrapping `row[1]` as `str(row[1])` – Patrick Maupin Sep 11 '15 at 04:40