I am new to NLP and I am trying to get up to speed with things in this area. I am testing two samples of code, as seen below.
# Starting with the CountVectorizer/TfidfTransformer approach...
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec
# Calculate all the n-grams found in all documents
from itertools import islice
cvec.fit(body_list)
The last line of code throws an error and the error message is:
AttributeError: 'float' object has no attribute 'lower'
I am also testing this code sample:
from sklearn.feature_extraction.text import TfidfVectorizer
X = body_list
y = df['helpful_count'].tolist()
cv = TfidfVectorizer()
df_xcv = cv.fit_transform(X)
Again, the last line of code throws the same error; error message is:
AttributeError: 'float' object has no attribute 'lower'
In both code samples, I am feeding a list into the X variable; the list comes from here.
body_list = df['body'].tolist()
So, 'body' is a field in a dataframe and it has thousands of rows of comments from retail shoppers, and it looks something like this:
'perfect product and price thank you',
'way to dark for me as i am fair this may work for darker skin tones',
'i love la rocheposay this truly balances out my skin well',
'my shoes started to tear on the outside of my mid foot be careful with them',
'perfect style and fit',
...]