Error with TfidfVectorizer and TfidfTransformer

Question

I am new to NLP and I am trying to get up to speed with things in this area. I am testing two samples of code, as seen below.

# Starting with the CountVectorizer/TfidfTransformer approach...
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec
# Calculate all the n-grams found in all documents
from itertools import islice
cvec.fit(body_list)

The last line of code throws an error and the error message is:

AttributeError: 'float' object has no attribute 'lower'

I am also testing this code sample:

from sklearn.feature_extraction.text import TfidfVectorizer  
X = body_list  
y = df['helpful_count'].tolist() 
cv = TfidfVectorizer()   
df_xcv = cv.fit_transform(X)

Again, the last line of code throws the same error; error message is:

AttributeError: 'float' object has no attribute 'lower'

In both code samples, I am feeding a list into the X variable; the list comes from here.

body_list = df['body'].tolist()

So, 'body' is a field in a dataframe and it has thousands of rows of comments from retail shoppers, and it looks something like this:

 'perfect product and price thank you',
 'way to dark for me as i am fair  this may work for darker skin tones',
 'i love la rocheposay this truly balances out my skin well',
 'my shoes started to tear on the outside of my mid foot be careful with them',
 'perfect style and fit',
 ...]

are body_list is in this format `['Sentence 1','sentence2','sentence 3']` — Shubham Shaswat, Feb 07 '20 at 16:29
I see that I didn't represent that well. I just updated my original post. I have one list, with all records separated by commas. Is that the problem? Should this be converted into a list of lists? — ASH, Feb 07 '20 at 16:36
it interesting that my code ran without any errors,although I skipped this part `y = df['helpful_count'].tolist()` — Shubham Shaswat, Feb 07 '20 at 16:40
Oh, I see what you are saying. When I feed just that small 5 lines of text, into 'body_list' and run it, it does work! I did clean the text column before passing it to the X variable. Nevertheless, I think there are some special characters that are messing things up. What is the best way to move forward here? How can I figure out what's causing the issue? — ASH, Feb 07 '20 at 16:46
check your cleaning steps one by one and test which one is causing the error,it is obvious that some float values are there in the `X` which some how are there due to those steps — Shubham Shaswat, Feb 07 '20 at 16:50
So, I ended up converting everything in that field to a string and now it works as expected. Final Solution: df['body'] = df[['body']].astype(str) — ASH, Feb 07 '20 at 17:09
so it seem like the values in process cleaning is not all string at all,glad it works — Shubham Shaswat, Feb 07 '20 at 17:18
Does this answer your question? [AttributeError: 'float' object has no attribute 'lower'](https://stackoverflow.com/questions/34724246/attributeerror-float-object-has-no-attribute-lower) — AMC, Feb 08 '20 at 01:01

Error with TfidfVectorizer and TfidfTransformer

0 Answers0