0

I have a data frame with 201279 entries, the last column is labeled "text" with customer reviews. The problem is that most of them are missing values, and come up as NaN.

I read some interesting information from this question: Python numpy.nan and logical functions: wrong results

and I tried applying it to my problem:

    df1.columns

Index(['id', 'sku', 'title', 'reviewCount', 'commentCount', 'averageRating',
       'date', 'time', 'ProductName', 'CountOfBigTransactions', 'ClassID',
       'Weight', 'Width', 'Depth', 'Height', 'LifeCycleName', 'FinishName',
       'Color', 'Season', 'SizeOrUtility', 'Material', 'CountryOfOrigin',
       'Quartile', 'display-name', 'online-flag', 'long-description', 'text'],
      dtype='object')

I tried experimentingby doing this: df['firstName'][202360]== np.nan

which returns False but indeed that index contains an np.nan.

So I looked for an answer, read through the question I linked, and saw that

np.bool(df1['text'][201279])==True

is a true statement. I thought, okay, I can run with this.

So, here's my code so far:

from textblob import TextBlob
import string

def remove_num_punct(aText):
    p = string.punctuation
    d = string.digits
    j = p + d
    table = str.maketrans(j, len(j)* ' ')
    return aText.translate(table)

#Process text
aList = []
for text in df1['text']:
    if np.bool(df1['text'])==True:
        aList.append(np.nan)
    else:
        b = remove_num_punct(text)
        pol = TextBlob(b).sentiment.polarity
        aList.append(pol)

Then I would just convert aList with the sentiment to pd.DataFrame and join it to df1, then impute the missing values with K-nearest neighbors.

My problem is that the little routine I made throws a value error

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

So I'm not really sure what else to try. Thanks in advance!

EDIT: I have tried this:

i = 0
aList = []
for txt in df1['text'].isnull():
    i += 1
    if txt == True:
        aList.append(np.nan)

which correctly populates the list with NaN.

But this gives me a different error:

i = 0
aList = []
for txt in df1['text'].isnull():
    if txt == True:
        aList.append(np.nan)
    else:
        b = remove_num_punct(df1['text'][i])
        pol = TextBlob(b).sentiment.polarity
        aList.append(pol)
        i+=1

AttributeError: 'float' object has no attribute 'translate'

Which doesn't make sense, since if it is not NaN, then it contains text, right?

Jabernet
  • 381
  • 2
  • 4
  • 19
  • Okay I may get df.isnull() to work but it throws an error if used on an individual index of df1['text'] – Jabernet Mar 07 '19 at 05:09

2 Answers2

1
import pandas as pd
import numpy as np

df = pd.DataFrame({'age': [5, 6, np.NaN],
                   'born': [pd.NaT, pd.Timestamp('1939-05-27'), pd.Timestamp('1940-04-25')],
                   'name': ['Alfred', 'Batman', ''],
                   'toy': [None, 'Batmobile', 'Joker']})

df1 = df['toy']
for i in range(len(df1)):
    if not df1[i]:
        df2 = df1.drop(i)

df2

you can try in this way to deal the text which is null

Tom.chen.kang
  • 173
  • 2
  • 9
0

I fixed it, I had to move the i += 1 back from the else indentation to the for indentation:

i = 0
aList = []
for txt in df1['text'].isnull():
    if txt == True:
        aList.append(np.nan)
    else:
        b = remove_num_punct(df1['text'][i])
        pol = TextBlob(b).sentiment.polarity
        aList.append(pol)
    i+=1
Jabernet
  • 381
  • 2
  • 4
  • 19