2

The data I have is stored in a pandas dataframe - please see a reproducible example below. The real dataframe will have more than 10k lines and a lot more ords / phrases per line. I'd like to count the number of times each two-word phrase appears in column ReviewContent. If this were a text file and not the column of a dataframe I would use NLTK's Collocations module (something along the lines of answers here or here ). My question is: how can I transform column ReviewContent into a single corpus text?

import numpy as np
import pandas as pd

data = {'ReviewContent' : ['Great food',
'Low prices but above average food',
'Staff was the worst',
'Great location and great food',
'Really low prices',
'The daily menu is usually great',
'I waited a long time to be served, but it was worth it. Great food']}

df = pd.DataFrame(data)

Expected output:

[(('great', 'food'), 3), (('low', 'prices'), 2), ...]

or

[('great food', 3), ('low prices', 2)...]
Community
  • 1
  • 1
BogdanC
  • 1,316
  • 3
  • 16
  • 36

3 Answers3

4

As a sequence/iterable, df["ReviewContent"] is structured exactly the same as the result of applying nltk.sent_tokenize() to a file of text: A list of strings containing one sentence each. So just use it the same way.

counts = collections.Counter()
for sent in df["ReviewContent"]:
    words = nltk.word_tokenize(sent)
    counts.update(nltk.bigrams(words))

If you aren't sure what to do next, that's not connected to using a dataframe. For counting bigrams you don't need the collocations module, just nltk.bigrams() and a counting dictionary.

alexis
  • 48,685
  • 16
  • 101
  • 161
2

I suggest using join:

corpus = ' '.join(df.ReviewContent)

Here's the result:

In [102]: corpus
Out[102]: 'Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily menu is usually great I waited a long time to be served, but it was worth it. Great food'
Adrienne
  • 326
  • 1
  • 3
  • 2
    That would work but it would create "artificial" phrases - the last word of a review joined with the first word of the next review. I could probably workaround this somehow - if I don't receive a better answer I'll certainly choose this one. – BogdanC May 16 '17 at 12:43
  • Hopefully my answer gets at your question "how can I transform column ReviewContent into a single corpus text?" I agree about the downside of artificial phrases and wonder how others handle this. In the past, I've tried joining the text with an indicator symbol like `~` instead of a space, and then use `finder = BigramCollocationFinder.from_words(corpus)` followed by a filter to remove the artificial phrases: `finder.apply_word_filter(lambda w: w == '~')`, based on this example code: http://www.nltk.org/howto/collocations.html. – Adrienne May 16 '17 at 12:58
1

Using Pandas version 0.20.1+, you can create SparseDataFrame directly from sparse matrixes:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(2,2))

r = pd.SparseDataFrame(cv.fit_transform(df.ReviewContent), 
                       columns=cv.get_feature_names(),
                       index=df.index,
                       default_fill_value=0)

Result:

In [52]: r
Out[52]:
   above average  and great  average food  be served  but above  but it  daily menu  great food  great location  \
0              0          0             0          0          0       0           0           1               0
1              1          0             1          0          1       0           0           0               0
2              0          0             0          0          0       0           0           0               0
3              0          1             0          0          0       0           0           1               1
4              0          0             0          0          0       0           0           0               0
5              0          0             0          0          0       0           1           0               0
6              0          0             0          1          0       1           0           1               0

   is usually    ...     staff was  the daily  the worst  time to  to be  usually great  waited long  was the  was worth  \
0           0    ...             0          0          0        0      0              0            0        0          0
1           0    ...             0          0          0        0      0              0            0        0          0
2           0    ...             1          0          1        0      0              0            0        1          0
3           0    ...             0          0          0        0      0              0            0        0          0
4           0    ...             0          0          0        0      0              0            0        0          0
5           1    ...             0          1          0        0      0              1            0        0          0
6           0    ...             0          0          0        1      1              0            1        0          1

   worth it
0         0
1         0
2         0
3         0
4         0
5         0
6         1

[7 rows x 29 columns]

If you simply want to concatenate the strings from all rows into a single one, use Series.str.cat() method:

text = df.ReviewContent.str.cat(sep=' ')

Result:

In [57]: print(text)
Great food Low prices but above average food Staff was the worst Great location and great food Really low prices The daily me
nu is usually great I waited a long time to be served, but it was worth it. Great food
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419