3

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.

I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.

This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.

I have look through a few answers but none were successful for me including:

using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)

Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?

The code I currently have (that works for individual terms) is:

for i in df1.index:
    descriptions = df1['detaileddescription'][i]
    if type(descriptions) is str:
        descriptions = descriptions.split()
        pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
    else:
        pool.append(0)
print(pool)
henryjgilroy
  • 31
  • 1
  • 5

2 Answers2

6

You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.

To do that, check out the NLTK package:

from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
    print(gram)
David Stevens
  • 835
  • 1
  • 6
  • 15
  • 2
    You also might want to look into the Spacy library, which has n-gram tokenizers as well, and has been fun to work with in my experience. – matisetorm Feb 20 '18 at 12:33
1

Sklearn's CountVectorizer is the standard way

from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)

And if you want to see the counts as a dict:

count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)

The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

Uri Goren
  • 13,386
  • 6
  • 58
  • 110