Modifying corpus by inserting codewords using Python

Question

I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:

This bike is amazing, but the brake is very poor
This ice maker works great, the price is very reasonable, some bad smell from the ice maker
The food was awesome, but the water was very rude

I want to change these texts to the following:

This bike is amazing POSITIVE, but the brake is very poor NEGATIVE
This ice maker works great POSITIVE and the price is very reasonable POSITIVE, some bad NEGATIVE smell from the ice maker
The food was awesome POSITIVE, but the water was very rude NEGATIVE

I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:

amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind

And, a text file contains such negative words as:

rude
poor
worst
dirty
slow
bad

So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.

Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.

Specifically, my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").

#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings

import re

def multiple_replacer(*key_values):
    replace_dict = dict(key_values)
    replacement_function = lambda match: replace_dict[match.group(0)]
    pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
    return lambda string: pattern.sub(replacement_function, string)

def multiple_replace(string, *key_values):
    return multiple_replacer(*key_values)(string)

#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")      

my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))

d = []
with open("review.txt","r") as file:
    for line in file:
        review = line.strip()
        d.append(review) 

for line in d:
    print my_escaper(line)

I have added an explanation about what works and what needs more. Hope this makes sense to you. Thanks. — kevin, Apr 22 '15 at 19:07

Matthew Nizol · Accepted Answer · 2015-04-22T20:18:31.690

A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.

Example:

import re

reviews = [
    "This bike is amazing, but the brake is very poor",
    "This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
    "The food was awesome, but the water was very rude"
    ]

positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])

for sentence in reviews:
    tagged = []
    for word in re.split('\W+', sentence):
        tagged.append(word)
        if word.lower() in positive_words:
            tagged.append("POSITIVE")
        elif word.lower() in negative_words:
            tagged.append("NEGATIVE")
    print ' '.join(tagged)

While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split().

wow! any suggestion to generate the output file in either csv or txt? thanks so much for your insight! — kevin, Apr 22 '15 at 20:48
To write the resulting sentence to a text file, you can either use the print() function or the write() method of a file object. See http://stackoverflow.com/questions/6159900/correct-way-to-write-line-to-file-in-python. — Matthew Nizol, Apr 22 '15 at 20:57

score 0 · Answer 2 · answered Apr 22 '15 at 20:01

0

If I understood correctly, you need something like:

if word in POSITIVE_LIST:
  pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
  pattern.sub(replacement_function, word+" NEGATIVE")

Is it OK with you?

answered Apr 22 '15 at 20:01

wanderlust

1,826
1
21
25

Modifying corpus by inserting codewords using Python

2 Answers2