22

I want to count the number of times a word is being repeated in the review string

I am reading the csv file and storing it in a python dataframe using the below line

reviews = pd.read_csv("amazon_baby.csv")

The code in the below lines work when I apply it to a single review.

print reviews["review"][1]
a = reviews["review"][1].split("disappointed")
print a
b = len(a)
print b

The output for the above lines were

it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.
['it came early and was not ', '. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.']
2

When I apply the same logic to the entire dataframe using the below line. I receive an error message

reviews['disappointed'] = len(reviews["review"].split("disappointed"))-1

Error message:

Traceback (most recent call last):
  File "C:/Users/gouta/PycharmProjects/MLCourse1/Classifier.py", line 12, in <module>
    reviews['disappointed'] = len(reviews["review"].split("disappointed"))-1
  File "C:\Users\gouta\Anaconda2\lib\site-packages\pandas\core\generic.py", line 2360, in __getattr__
    (type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'split'
goutam
  • 657
  • 2
  • 13
  • 35

4 Answers4

25

You're trying to split the entire review column of the data frame (which is the Series mentioned in the error message). What you want to do is apply a function to each row of the data frame, which you can do by calling apply on the data frame:

f = lambda x: len(x["review"].split("disappointed")) -1
reviews["disappointed"] = reviews.apply(f, axis=1)
Hirak Sarkar
  • 125
  • 8
hoyland
  • 1,776
  • 14
  • 14
  • Do I need to add any other lines to this code other than reading the data into the reviews variable? Because the above two lines did not work. – goutam Mar 19 '16 at 23:21
  • I think it should work as written, but I didn't test it. What went wrong? – hoyland Mar 19 '16 at 23:29
  • File "Classifier.py", line 18, in reviews["disappointed"] = reviews.apply(f, axis=1) File "pandas\core\frame.py", line 3972, in apply return self._apply_standard(f, axis, reduce=reduce) File "pandas\core\frame.py", line 4064, in _apply_standard results[i] = func(v) File "Classifier.py", line 17, in f = lambda x: len(reviews["review"].split("disappointed")) -1 File "pandas\core\generic.py", line 2360, in __getattr__ (type(self).__name__, name)) AttributeError: ("'Series' object has no attribute 'split'", u'occurred at index 0') – goutam Mar 19 '16 at 23:34
  • Oops. It should be `lambda x: len(x["review"].split("disappointed")) -1`. `x` is then the row being passed to the function not the whole data frame itself. – hoyland Mar 19 '16 at 23:35
16

pandas 0.20.3 has pandas.Series.str.split() which acts on every string of the series and does the split. So you can simply split and then count the number of splits made

len(reviews['review'].str.split('disappointed')) - 1

pandas.Series.str.split

Austin
  • 404
  • 4
  • 14
  • 1
    I think this is the most Pandas-y solution, and probably faster. Wonder if the OP got a chance to do performance testing on it. – rajan Sep 24 '19 at 06:56
2

Well, the problem is with:

reviews["review"]

The above is a Series. In your first snippet, you are doing this:

reviews["review"][1].split("disappointed")

That is, you are putting an index for the review. You could try looping over all rows of the column and perform your desired action. For example:

for index, row in reviews.iterrows():
    print len(row['review'].split("disappointed"))

    
cidetto
  • 43
  • 2
  • 9
Hossain Muctadir
  • 3,546
  • 1
  • 19
  • 33
2

You can use .str to use string methods on series of strings:

reviews["review"].str.split("disappointed")
Stop harming Monica
  • 12,141
  • 1
  • 36
  • 56
  • str won't solve the problem. reviews["review"] gives a series of strings not one string. – Ozgur Ozturk Feb 02 '17 at 18:39
  • @OzgurOzturk It does solve the problem of applying `split` to each row. Id does not solve the problem of computing the lengths because I thought that was easy enough. And I know that `reviews["review"]` is a series of strings. Why did you think I don't? – Stop harming Monica Feb 02 '17 at 22:39
  • This sort of worked for my problem....how would you get only the value after the split for the series? – santma Jun 03 '20 at 05:00
  • @santma I don't know what you mean by "the value" here. You might want to ask a new question, including a [mcve]. – Stop harming Monica Jun 03 '20 at 10:19
  • @StopharmingMonica When I use .split(), I get both sides of the split() returned. For example, I have a list of URLs in a dataframe that follow this format: 'https://www.domain.co/product/product-name/'. I'd like to get just "product-name". When I used .split("product/") I get ["https://www.domain.co/","product-name/"]. – santma Jun 03 '20 at 18:08