I have a dataframe df that has the columns id, text, lang, stemmed, and tfidfresult. df has 24 rows. I found the dissimilarity matrix (distance matrix) based on the tfidf result which gives how dissimilar two rows in the dataframe are.
A sample of how the dataframe looks is:
id text lang stemmed tf_idfresult
0 234 Hi this en [hi, this] [0.0, 0.2]
1 232 elephants ruined again en [elephants, ruined, again] [0.1, 0.0, 0.0]
2 441 there are palm trees en [there, are, palm, trees] [0.2, 0.54, 0.0, 0.823]
3 235 so much to do en [so, much, to, do] [0.1, 0.1, 0.0, 0.0]
The dissimilarity matrix dis was found with the help of the cosine_similarity function and looks as
[[0.0, 0.3, 0.1, 1, 1...]
[0.1, ...]
.
.
for 24 rows and 24 columns.
I used silhouette method and found the best value for k which is 3. I tried doing
pam = kmedoids(dis, initialmedoids)
but I don't know how to find the initial medoids. The expected output is the dataframe in three clusters. I don't have any specific format for the output.