10

I am trying to fit Blaze data object to scikit kmeans function.

from blaze import *
from sklearn.cluster import KMeans
data_numeric = Data('data.csv')
data_cluster = KMeans(n_clusters=5)
data_cluster.fit(data_numeric)

Data Sample:

A  B  C
1  32 34
5  57 92
89 67 21

Its throwing error :

enter image description here

I have been able to do it with Pandas Dataframe. Any way to feed blaze object to this function ?

sachin saxena
  • 926
  • 5
  • 18
  • Double check to see the size of the array that you're passing into k-means. Typically this error is thrown when a 1-D array is being passed. – jonplaca Sep 29 '16 at 15:13
  • How many samples do you have in your blaze object ? – MMF Oct 07 '16 at 15:03

4 Answers4

5

I think you need to convert your pandas dataframe into an numpy array before you fit.

from blaze import *
import numpy

from sklearn.cluster import KMeans
data_numeric = numpy.array(data('data.csv'))
data_cluster = KMeans(n_clusters=5)
data_cluster.fit(data_numeric)
aberger
  • 2,299
  • 4
  • 17
  • 29
2

sklearn.cluster.KMeans don't support input data with type blaze.interactive._Data which is the type of data_numeric in your code.

You can use data_cluster.fit(data_numeric.peek()) to fit the transferred data_numeric with type DataFrame supported by sklearn.cluster.KMeans.

mhasan
  • 3,703
  • 1
  • 18
  • 37
yhuang
  • 21
  • 4
1

I would suggest that you choose the number of clusters (K) to be much smaller than the number of training examples you have in your data set. It is not right to run the K-Means algorithm when the number of clusters you desire is greater than or equal to the number of training examples. The error occurs when you try to pass the blaze object with an undesirable shape, to the KMeans function. Please check : https://blaze.readthedocs.io/en/latest/csv.html

PJay
  • 2,557
  • 1
  • 14
  • 12
  • I am passing around 30000 rows of data to the function, here I have pasted only sample 3 rows. – sachin saxena Oct 06 '16 at 11:31
  • You need to use the reshape function in the `data_cluster.fit(data_numeric)` command and reshape your array as a 2D array form, that scikit's K-Means will accept. – PJay Oct 06 '16 at 11:46
0

Yes,before you fit ,you must need to convert your pandas dataframe into an numpy array,now its works fine...i think @aberger already answered .

thank you!

heart hacker
  • 431
  • 2
  • 8
  • 22