1

DataFrame

Hi all. I am working on a dataframe (picture above) with over 18000 observations. What I'd like to do is to get the text in the column 'review' one after the other and then do a word count later on it. At the moment I have been trying to iterate over it but I have been getting error like "TypeError: 'float' object is not iterable". Here is the code I used:

def tokenize(text):
    for row in text:
        for i in row:
            if i is not None:
                words = i.lower().split()
                return words
            else:
                return None

data['review_two'] = data['review'].apply(tokenize)

Now my question is: how do I iterate effectively and efficiently over the column 'review' so that I can now preprocess each row one after the other before I now perform word count on it?

nnnmmm
  • 7,964
  • 4
  • 22
  • 41
Tunde
  • 83
  • 10
  • Please post real data here. You can use `data.head(10).to_dict()` function to retrieve the first 10 rows and turn it into dictionary that people can easily process. – Tai Jan 08 '18 at 15:09
  • @Tai the data is a csv file in my machine. I wish there was a way to post it here. – Tunde Jan 09 '18 at 15:13

3 Answers3

1

My hypothesis for the error is that you have missing data, which is NaN and makes tokenize function fail. You can checkt it with pd.isnull(df["review"]), which will show you a boolean array that whether each line is NaN. If any(pd.isnull(df["review"])) is true, then there is a missing value in the column.

I cannot reproduce the error as I don't have the data, but I think your goal can be achieve with this.

from collections import Counter
df = pd.DataFrame([{"name": "A", "review": "No it is not good.", "rating":2},
                {"name": "B", "review": "Awesome!", "rating":5},
                 {"name": "C", "review": "This is fine.", "rating":3},
                 {"name": "C", "review": "This is fine.", "rating":3}])

# first .lower and then .replace for punctuations and finally .split to get lists
df["splitted"] = df.review.str.lower().str.replace('[^\w\s]','').str.split()

# pass a counter to count every list. Then sum counters. (Counters can be added.)
df["splitted"].transform(lambda x: Counter(x)).sum()

Counter({'awesome': 1,
     'fine': 2,
     'good': 1,
     'is': 3,
     'it': 1,
     'no': 1,
     'not': 1,
     'this': 2})

str.replace part is to remove punctuations see the answer Replacing punctuation in a data frame based on punctuation list from @EdChum

Tai
  • 7,684
  • 3
  • 29
  • 49
  • actually I checked if the data had null values but it returned false. I'am actually doing feature engineering. My aim is to first be able to Iterate over the df[review] and then take them one by one and split em before removing the punctuations. all these will be stored in another column in the dataframe. – Tunde Jan 09 '18 at 15:20
  • @TundeOre You can still check where it goes wrong by printing it out. Maybe the problem lies in the reading data part. – Tai Jan 09 '18 at 15:22
  • @TundeOre If that is your purpose, remove `str.replace('[^\w\s]','')` part. – Tai Jan 09 '18 at 15:23
  • I was able to do this: data["splits"] = data.review.str.lower().str.replace('[^\w\s]',' ').str.split(). It got me a new column called splits. However the second part of the code you produced above gave me an error : "TypeError: unhashable type: 'list' " – Tunde Jan 09 '18 at 15:34
  • @TundeOre There are some missing data, I suppose. Try `df.review.fillna("", inplace=True)` before you start the whole process. – Tai Jan 09 '18 at 15:40
  • I was able to do this: data["splits"] = data.review.str.lower().str.replace('[^\w\s]',' ').str.split(). It got me a new column called splits. However the second part of the code you produced above gave me an error : "TypeError: unhashable type: 'list' " – Tunde Jan 09 '18 at 15:41
  • @TundeOre Did you fix it? If so, can you accept the answer by clicking the check mark? I would appreciate it. – Tai Jan 09 '18 at 15:54
  • No. i tried running "df["splitted"].transform(lambda x: Counter(x)).sum()" and my jupyter notebook was frozen for a long time(maybe due to the number of observations: over 180000). I had to interupt the kernel. No result. I was wondering why you didn't use "apply"? that you had to use "transform". what advantages does transform have over "apply"? isn't there a way to convert this as a function? Someone thought the dataset should be in a pandas dataframe for it to function well but I don't think so. still waiting for more tips from you though thanks so far. – Tunde Jan 11 '18 at 09:03
  • My guess is that your dataset might be (1) not small and in some cells, the reviews might be long. I think both `apply` and `transform` might cause some overhead but sure I think you can try to use `apply`. Also, check whether you have enough memory, I think. – Tai Jan 11 '18 at 15:07
  • you are right. the observation is over 180K and some reviews are long too. Will try other machines though. However assuming the data was in pandas dataframe, will the approach you used differ? – Tunde Jan 13 '18 at 19:25
  • @TundeOre I think you should look into pyspark, perhaps. The work you're trying to do can be done spark's map reduce methods in parallel, I suppose. Keyword word count. You might first preprocess data with pandas and then pass it into pyspark. – Tai Jan 13 '18 at 19:50
0

I'm not sure what you're trying to do, especially with for i in row. In any case, apply already iterates over the rows of your DataFrame/Series, so there's no need to do it in the function that you pass to apply.

Besides, your code does not return a TypeError for a DataFrame such as yours where the columns contain strings. See here for how to check if your 'review' column contains only text.

nnnmmm
  • 7,964
  • 4
  • 22
  • 41
0

Maybe something like this, that gives you the word count, the rest I did not understand what you want.

import pandas as pd

a = ['hello friend', 'a b c d']
b = pd.DataFrame(a)

print(b[0].str.split().str.len())

>> 0    2
   1    4
SamuelNLP
  • 4,038
  • 9
  • 59
  • 102