0

Problem

How do I handle the Nan when almost all rows have them making dropna not a valid possibility. I would like to run similarity against a more complete data frame of reviews because currently if I dropna it removes all but 1 row making it not useful.

In past the original file only had a few nan so dropna worked but in this case 2800 rows have nan only 1 does not. I assume spaCy will not perform this with nan or 0's

I am trying to run spaCy similarity on a data frame that has several nan fields.

How do I handle the Na without dropping them How do you handle NaN

I tried 3 things to no success

dropna, fill with zeros and fill with med

I tried drop na and I also tried replace with fillzero but all but one review row is complete so it leaves nothing to compare.

New to NLP and got stuck here.

Import

import csv
import pandas as pd
import nltk
import numpy as np
from nltk.tokenize import PunktSentenceTokenizer,RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tempfile import NamedTemporaryFile
import shutil
import warnings

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

warnings.filterwarnings("ignore", category=DeprecationWarning)
#reading the attributes file
#check into the "attributes.txt" file for the proper format
#each attribute has to be listed in a new line.
attributes=list(line.strip() for line in open('attributes.txt'))
attributes=" ".join(attributes)

attributes.txt

this just load a text file with a list with words see below

Airy
Bright
Child-like
Effervescent
Graceful
Hymn-like
Innocent
Light
Naive
Optimistic
Poignant
Precious
Pure
Rhapsodic
Sacred
Shimmering
Sugary
Sweet
Tender
Thoughtful
Transparent/Translucent
Whimsical

reviews_df = pd.read_excel('adult_contemporary_reviews(A-B).xlsx',encoding='utf8', errors='ignore')
reviews_df.head()

The original data-frame

dataframe

reviews_df.shape

output (159, 32)

I tried to remove nan by filling with zeros

reviews_df=reviews_df.fillna(0)
reviews_df

dropna zero

This is where I got confused only one of these methods worked I believe

reviews_df['similarity'] = -1

# for index, row in reviews_df.iterrows():
# reviews_df.loc[index,'similarity'] = nlp(row["product_review"]).similarity(nlp(attributes))

for i in reviews_df.columns:
   reviews_df.loc[1,i] = nlp(reviews_df.loc[0,i]).similarity(nlp(attributes))

#     # Iterate over the sequence of column names
# for column in reviews_df:
#     reviews_df.loc[index,'0'] = nlp(column["aaronneville-reviews"]).similarity(nlp(attributes))

Type error I assume this is because I replace the Nan with zeros which are ints? Maybe

I get a TypeError: object of type 'numpy.int64' has no len()

type error

then lastly the output which I assume works but not as planned because it dropped everything

#writing to an output file
reviews_df.to_excel(r"C:\Users\Name\nlp\sim\Similarity_output.xlsx", index=False)

enter image description here

Because the original issue of imported has so many nan in the dataframe it outputted only 1 review and the similarity score it should instead compare all lines by handling the na somehow.

outputxslx file

john taylor
  • 1,080
  • 15
  • 31

1 Answers1

1

You are getting the error because similarity function does not accept np.float values. So your idea to use .fillna() is in the right direction. However, you have to ensure that all columns are of dtype str/object.

You can do this by

for i in reviews_df.columns:
   reviews_df[:, i] = reviews_df[:, i].astype(str)
   reviews_df.loc[1,i] = nlp(reviews_df.loc[0,i]).similarity(nlp(attributes))

Check out this SO post for more information about astype.

This will (probably) solve the TypeError, but I am not sure if it leads to meaningful results. Maybe you have to drop some of the columns, i.e. cleaning the dataframe before applying any metrics. But I don't understand your code enough to make any suggestions.

above_c_level
  • 3,579
  • 3
  • 22
  • 37
  • Cool, thanks so much my brain was hurting. I will convert the floats and fillna with blanks and make sure that all columns are of type str. The original attribute file was from different project so it had horrible results but will update that to get better results in thoery hopefully. But anyway it one step closer which is awesome in itself. – john taylor Jun 07 '20 at 20:04