0

I want to partition marketbasket products into balanced clusters (clusters having the same size). I tried K-means and PAM but i can't figure out a way to make the number of elements (products) in each cluster the same. For example, if i have N products and k clusters, i want to have N/K elements in each cluster (assuming that N is divisible by k). Any suggestions ? Thanks !

Here is an example: I have the following dataset:

df


    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
700109 0 0 0 0 0 0 0 0 0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
700174 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
700192 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1
700231 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
700534 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
700840 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
700871 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
700874 0 0 0 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
723229 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1
723243 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1
723351 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
727105 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0
727106 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
727121 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

It s a matrix where the rows (40 rows) stands for the products and the columns are the transactions (100 columns). Then i applied PAM:

>pam(df,4,diss=FALSE,metric="manhattan")

Clustering vector:
700109 700174 700192 700231 700534 700840 700871 700874 723229 723243 723351 
1      1      1      1      1      1      1      1      2      1      1
727105 727106 727121 727122 727125 727138 727220 727300 727302 727303 727311 
1      1      1      1      1      1      2      1      3      1      4 
727314 727342 727345 727347 727406 727415 727419 727710 728016 728017 728018 
1      3      4      1      1      1      1      1      1      1      1 
728020 728085 728086 728087 728088 728132 728134 
1      1      1      1      1      1      1 

where 700109,70014... are my products and the number of clusters is k=4. As you can see cluster1 contains 34 elements , while cluster2, cluster3 and cluster4 contains 2 elements each. So they are not balanced! I want a result such that each one of the four clusters contains 10 elements (because i took 40 products in this example).

BS.Mira
  • 103
  • 8
  • When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 29 '18 at 13:56
  • 1
    Have you seen [this suggestion](https://stackoverflow.com/a/5452702/8485403) by @FredFoo to use Lloyd's algorithm? – csgroen Mar 29 '18 at 14:20
  • 2
    Check this [thread](https://stackoverflow.com/questions/37619386/clustering-algorithm-for-obtaining-equal-sized-clusters) – AshOfFire Mar 29 '18 at 14:21
  • Thank you. I found some algorithms but they are a bit hard for me to type because i am new to R. I am searching for solutions written in R script . – BS.Mira Mar 29 '18 at 16:07
  • Are these values in the first column (700874 etc) just row-labels, or anonymous identifiers, or integers that carry actual semantics (e.g. they are prices in Zimbabwean dollars)? Or are you just dealing with 0s and 1s in a sparse matrix which could be visualised as "few white pixels in a black image", and you just want to separate these pixels on distance measures between those pixels? – knb Mar 30 '18 at 08:59
  • The values in the first column are the products IDs. The first row (1,2,3...) is the transactions IDs. If the product X is found in transaction Y then the cell d(X,Y) takes 1. Otherwise, it takes 0. For example product 700874 is found only in transaction 11. – BS.Mira Mar 30 '18 at 09:09
  • So why do you think the natural grouping of transactions would have partitions of the same size? It's certainly not supported by the data above. – Has QUIT--Anony-Mousse Mar 31 '18 at 05:21

0 Answers0