scikit-learn kmeans clustering text with jaccard distance

Question

I'm trying to use sklearn to cluster some tweets as a dictionary I have 25 initial centroids id (tweet id) I wrote it in my own functions, BUT I don't know how to implement it with sklearn

# {845512:'tweet id 845512', 543115:'tweet id 543115', ...}
# initial_centroids = [845512, 546318, 84632, ...] - 25 centroids

NOTE: tweets_vec <= I need to make it by jaccard_distance
tweets_vec = Is the jaccard distance matrix (it may be wrong, i dont know)

kmeans = KMeans(n_clusters=25, init=initial_seeds).fit(tweets_vec)

I made a 2D matrix in which there are jaccard distances. I don't know how to fix init in kmeans method. it errors that's not ndarray

what exactly should I pass to it?

Possible duplicate with: https://stackoverflow.com/questions/5529625/is-it-possible-to-specify-your-own-distance-function-using-scikit-learn-k-means where you could find your solution in the top answer. — Yuan JI, Jul 02 '19 at 07:53
That's not what I'm looking for. I may state the problem vaguely. I'll edit it to make it more clearer. — S Roshan, Jul 02 '19 at 08:08

score -1 · Answer 1 · answered Jul 04 '19 at 20:27

if within kmeans you pass init=initial_centroids, then initial_centroids must have shape clusters x features. If you are using only one feature you might have to reshape your array, try:

init_cent_array = np.asarray(initial_centroids).reshape(-1,len(initial_centroids))

and pass it as init argument in kmeans. Hope this helps.

scikit-learn kmeans clustering text with jaccard distance

1 Answers1