-1

I'm trying to do a clustering. I'm doing with pandas and sklearn.

import pandas
import pprint
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.feature_extraction.text import TfidfVectorizer

dataset = pandas.read_csv('text.csv', encoding='utf-8')

dataset_list = dataset.values.tolist()


vectors = TfidfVectorizer()
X = vectors.fit_transform(dataset_list)

clusters_number = 20

model = KMeans(n_clusters = clusters_number, init = 'k-means++', max_iter = 300, n_init = 1)

model.fit(X)

centers = model.cluster_centers_
labels = model.labels_

clusters = {}
for comment, label in zip(dataset_list, labels):
    print ('Comment:', comment)
    print ('Label:', label)

try:
    clusters[str(label)].append(comment)
except:
    clusters[str(label)] = [comment]
pprint.pprint(clusters)

But I have the following error, even though I have never used the lower():

File "clustering.py", line 19, in <module>
    X = vetorizer.fit_transform(dataset_list)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
  File "/usr/lib/python3/dist-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

I don't understand, my text (text.csv) is already lowercase. And I at no time called lower()

Data:

hello wish to cancel order thank you confirmation

hello would like to cancel order made today store house world

dimensions bed not compatible would like to know how to pass cancellation refund send today cordially

hello possible cancel order cordially

hello wants to cancel order request refund

hello wish to cancel this order can indicate process cordially

hello seen date delivery would like to cancel order thank you

hello wants to cancel matching order good delivery n ° 111111

hi would like to cancel this order

hello order product store cancel act doublon advance thank you cordially

hello wishes to cancel order thank you kindly refund greetings

hello possible cancel order please thank you in advance forward cordially

Community
  • 1
  • 1
marin
  • 923
  • 2
  • 18
  • 26
  • 1
    try `vectors = TfidfVectorizer(lowercase=False)` – Rakesh Jul 24 '18 at 11:46
  • 1
    Can you try changing the name of your script from clustering.py to my_clustering.py ? – Chris_Rands Jul 24 '18 at 11:46
  • Didn't work, I have another error: TypeError: expected string or bytes-like object – marin Jul 24 '18 at 12:08
  • Also change the name of the csv text file to my_text.csv – Chris_Rands Jul 24 '18 at 12:14
  • 1
    can you add the data ? – seralouk Jul 24 '18 at 17:37
  • 1
    How many columns do you have in `text.csv`? If more than one column, then you cannot use TfidfVectorizer on it. – Vivek Kumar Jul 25 '18 at 07:24
  • I only have one column (a few thousand long sentences) – marin Jul 25 '18 at 07:42
  • 1
    Can you confirm that `pd.read_csv` is returning single column data. Try `print(len(dataset.columns))`. Can you show a part of your actual test.csv (not examples of sentences), which I can load the same way you have using `pd.read_csv`? – Vivek Kumar Jul 25 '18 at 11:06
  • For print(len(dataset.columns)), my result is 1 (I have 60,000 rows in my dataset). I edited my comment showing a part of my dataset (there are no white lines, I just don't know how to edit a dataset here). Thanks! – marin Jul 25 '18 at 12:08

1 Answers1

5

The error is in this line:

dataset_list = dataset.values.tolist()

You see, dataset is a pandas DataFrame, so when you do dataset.values, it will be converted to a 2-d dataset of shape (n_rows, 1) (Even if the number of columns are 1). Then calling tolist() on this will result in a list of lists, something like this:

print(dataset_list)

[[hello wish to cancel order thank you confirmation],
 [hello would like to cancel order made today store house world],
 [dimensions bed not compatible would like to know how to pass cancellation refund send today cordially]
 ...
 ...
 ...]]

As you see, there are two square brackets here.

Now TfidfVectorizer only requires a list of sentences, not lists of list and hence the error (because TfidfVectorizer assumes internal data to be sentences, but here it is a list).

So you just need to do this:

# Use ravel to convert 2-d to 1-d array
dataset_list = dataset.values.ravel().tolist()

OR

# Replace `column_name` with your actual column header, 
# which converts DataFrame to Series
dataset_list = dataset['column_name'].values).tolist()
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132