Problem
How do I handle the Nan when almost all rows have them making dropna not a valid possibility. I would like to run similarity against a more complete data frame of reviews because currently if I dropna it removes all but 1 row making it not useful.
In past the original file only had a few nan so dropna worked but in this case 2800 rows have nan only 1 does not. I assume spaCy will not perform this with nan or 0's
I am trying to run spaCy similarity on a data frame that has several nan fields.
How do I handle the Na without dropping them How do you handle NaN
I tried 3 things to no success
dropna, fill with zeros and fill with med
I tried drop na and I also tried replace with fillzero but all but one review row is complete so it leaves nothing to compare.
New to NLP and got stuck here.
Import
import csv
import pandas as pd
import nltk
import numpy as np
from nltk.tokenize import PunktSentenceTokenizer,RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tempfile import NamedTemporaryFile
import shutil
import warnings
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
warnings.filterwarnings("ignore", category=DeprecationWarning)
#reading the attributes file
#check into the "attributes.txt" file for the proper format
#each attribute has to be listed in a new line.
attributes=list(line.strip() for line in open('attributes.txt'))
attributes=" ".join(attributes)
attributes.txt
this just load a text file with a list with words see below
Airy
Bright
Child-like
Effervescent
Graceful
Hymn-like
Innocent
Light
Naive
Optimistic
Poignant
Precious
Pure
Rhapsodic
Sacred
Shimmering
Sugary
Sweet
Tender
Thoughtful
Transparent/Translucent
Whimsical
reviews_df = pd.read_excel('adult_contemporary_reviews(A-B).xlsx',encoding='utf8', errors='ignore')
reviews_df.head()
The original data-frame
reviews_df.shape
output (159, 32)
I tried to remove nan by filling with zeros
reviews_df=reviews_df.fillna(0)
reviews_df
This is where I got confused only one of these methods worked I believe
reviews_df['similarity'] = -1
# for index, row in reviews_df.iterrows():
# reviews_df.loc[index,'similarity'] = nlp(row["product_review"]).similarity(nlp(attributes))
for i in reviews_df.columns:
reviews_df.loc[1,i] = nlp(reviews_df.loc[0,i]).similarity(nlp(attributes))
# # Iterate over the sequence of column names
# for column in reviews_df:
# reviews_df.loc[index,'0'] = nlp(column["aaronneville-reviews"]).similarity(nlp(attributes))
Type error I assume this is because I replace the Nan with zeros which are ints? Maybe
I get a TypeError: object of type 'numpy.int64' has no len()
then lastly the output which I assume works but not as planned because it dropped everything
#writing to an output file
reviews_df.to_excel(r"C:\Users\Name\nlp\sim\Similarity_output.xlsx", index=False)
Because the original issue of imported has so many nan in the dataframe it outputted only 1 review and the similarity score it should instead compare all lines by handling the na somehow.