Blaze with Scikit Learn K-Means

Question

I am trying to fit Blaze data object to scikit kmeans function.

from blaze import *
from sklearn.cluster import KMeans
data_numeric = Data('data.csv')
data_cluster = KMeans(n_clusters=5)
data_cluster.fit(data_numeric)

Data Sample:

Its throwing error :

I have been able to do it with Pandas Dataframe. Any way to feed blaze object to this function ?

Double check to see the size of the array that you're passing into k-means. Typically this error is thrown when a 1-D array is being passed. — jonplaca, Sep 29 '16 at 15:13

score 5 · Accepted Answer · answered Oct 07 '16 at 14:53

I think you need to convert your pandas dataframe into an numpy array before you fit.

from blaze import *
import numpy

from sklearn.cluster import KMeans
data_numeric = numpy.array(data('data.csv'))
data_cluster = KMeans(n_clusters=5)
data_cluster.fit(data_numeric)

score 2 · Answer 2 · edited Oct 10 '16 at 06:57

2

sklearn.cluster.KMeans don't support input data with type blaze.interactive._Data which is the type of data_numeric in your code.

You can use data_cluster.fit(data_numeric.peek()) to fit the transferred data_numeric with type DataFrame supported by sklearn.cluster.KMeans.

edited Oct 10 '16 at 06:57

mhasan

3,703
1
18
37

answered Oct 10 '16 at 06:22

yhuang

21
4

score 1 · Answer 3 · answered Oct 06 '16 at 09:14

1

I would suggest that you choose the number of clusters (K) to be much smaller than the number of training examples you have in your data set. It is not right to run the K-Means algorithm when the number of clusters you desire is greater than or equal to the number of training examples. The error occurs when you try to pass the blaze object with an undesirable shape, to the KMeans function. Please check : https://blaze.readthedocs.io/en/latest/csv.html

answered Oct 06 '16 at 09:14

PJay

2,557
1
14
12

I am passing around 30000 rows of data to the function, here I have pasted only sample 3 rows. – sachin saxena Oct 06 '16 at 11:31
You need to use the reshape function in the `data_cluster.fit(data_numeric)` command and reshape your array as a 2D array form, that scikit's K-Means will accept. – PJay Oct 06 '16 at 11:46

score 0 · Answer 4 · answered Oct 12 '16 at 06:30

0

Yes,before you fit ,you must need to convert your pandas dataframe into an numpy array,now its works fine...i think @aberger already answered .

thank you!

answered Oct 12 '16 at 06:30

heart hacker

431
2
8
22

1

converting to dataframe is an expencive process, but looks like there is no other way to do it. – sachin saxena Oct 12 '16 at 07:15

Blaze with Scikit Learn K-Means

4 Answers4