Vectorize list of lists uisng countvectorizer() & tfidfvectorizer()

Question

So I have the following list of lists which is tokenized:

tokenized_list = [['ALL', 'MY', 'CATS', 'IN', 'A', 'ROW'], ['WHEN', 'MY', 
                   'CAT', 'SITS', 'DOWN', ',', 'SHE', 'LOOKS', 'LIKE', 'A', 
                   'FURBY', 'TOY', '!'], ['THE', CAT', 'FROM', 'OUTER', 
                   'SPACE'], ['SUNSHINE', 'LOVES', 'TO', 'SIT', 
                   'LIKE', 'THIS', 'FOR', 'SOME', 'REASON', '.']]

When i try to vectorize it using the CountVectorizer() or TfIdfVectorizer()

 from sklearn.feature_extraction.text import CountVectorizer
 vectorizer = CountVectorizer()
 print(vectorizer.fit_transform(tokenized_list).todense()) 
 print(vectorizer.vocabulary_)

I am getting the following error:

AttributeError: 'list' object has no attribute 'lower'

And if I put a simple list inside the vectorizer.fit_transform() function it works properly.

How do I remove this error?

I just want to use the count vectorizer function in the process of building a bag of words model in nlp. What do you mean by flatten ? Sorry, new to this. — explorer_x, Apr 10 '18 at 08:28
You can just convert the inner list to string by using `tokenized_list = [' '.join(inner_list) for inner_list in tokenized_list]` — Vivek Kumar, Apr 10 '18 at 08:48
@juanpa.arrivillaga IMO this is not a duplicate of the question specified. Flattening the posted list as it is actually gives a wrong result. `@Vivek Kumar`'s comment solves the question by joining the inner lists to strings. Would you reconsider the flag? — WolfgangK, Apr 10 '18 at 15:05

Rakesh · Answer 1 · 2018-04-10T08:35:54.480

tokenized_list is a list of lists. Trying iterating the list and passing the sublist as argument or you can flatten the list.

Ex:

tokenized_list = [['ALL', 'MY', 'CATS', 'IN', 'A', 'ROW'], ['WHEN', 'MY', 
                   'CAT', 'SITS', 'DOWN', ',', 'SHE', 'LOOKS', 'LIKE', 'A', 
                   'FURBY', 'TOY', '!'], ['THE', 'CAT', 'FROM', 'OUTER', 
                   'SPACE'], ['SUNSHINE', 'LOVES', 'TO', 'SIT', 
                   'LIKE', 'THIS', 'FOR', 'SOME', 'REASON', '.']]

from itertools import chain
tokenized_list = list(chain(*tokenized_list))
print(vectorizer.fit_transform(tokenized_list).todense())

Output:

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

This re-fits the data each iteration, almost certainly not what is desired. — juanpa.arrivillaga, Apr 10 '18 at 08:20
For your earlier answer - it was giving me the same error as before. — explorer_x, Apr 10 '18 at 09:04
I need to get this for each unique value, for eg in this corpus there are 26 unique variables, so i should be getting 26 lists. Not more than that. — explorer_x, Apr 10 '18 at 09:07
@Rakesh Thanks i got what i desired from the above solution. — explorer_x, Apr 10 '18 at 09:42

Vectorize list of lists uisng countvectorizer() & tfidfvectorizer()

1 Answers1