0

I have a dataframe df that has the columns id, text, lang, stemmed, and tfidfresult. df has 24 rows. I found the dissimilarity matrix (distance matrix) based on the tfidf result which gives how dissimilar two rows in the dataframe are.

A sample of how the dataframe looks is:

   id     text                lang                    stemmed                  tf_idfresult
0 234  Hi this                  en [hi, this]                   [0.0, 0.2]
1 232  elephants ruined again   en [elephants, ruined, again]   [0.1, 0.0, 0.0]
2 441  there are palm trees     en [there, are, palm, trees]    [0.2, 0.54, 0.0, 0.823]
3 235  so much to do            en [so, much, to, do]           [0.1, 0.1, 0.0, 0.0]

The dissimilarity matrix dis was found with the help of the cosine_similarity function and looks as

[[0.0, 0.3, 0.1, 1, 1...]
[0.1, ...]
.
.

for 24 rows and 24 columns.

I used silhouette method and found the best value for k which is 3. I tried doing

pam = kmedoids(dis, initialmedoids)

but I don't know how to find the initial medoids. The expected output is the dataframe in three clusters. I don't have any specific format for the output.

aak122114
  • 23
  • 5
  • Please provide a full copy and pastable sample pandas dataset as well as your expected output. Please see how to ask pandas questions here: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Dec 06 '20 at 05:17
  • @DavidErickson Okay I will edit the question – aak122114 Dec 06 '20 at 05:24

1 Answers1

0

I've also been trying to work with k-medoids and have been so lost! I read about a handful of tools for doing it. Two of them are:

  • sklearn_extra.cluster.KMedoids. Set the kargs method='pam' and metric='precomputed'. After running the analysis, you can see to which cluster each sample was assigned with kmedoids.labels_. You can use this tutorial as a basis for writing a program that separates the samples according to clusters.

  • pyclustering.cluster.kmedoid. This is the one you're using, I guess? In accordance to your code, you should:

from pyclustering.cluster.kmedoids import kmedoids

pam = kmedoids(dis, initialmedoids)

pam.process()

clusters = pam.get_clusters()
Maria
  • 327
  • 4
  • 13