Clustering many datasets R by 2 variables

Question

I have 15 dataframes their structure is similar : (id,v1,v2) What I want is to cluster them based on V1 and V2

Sample Data from df1:

ID, V1, V2
1, 0.5, 25
2, 0.3, 2

Hi! Please reference [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Consider using `dput()` to share some of your data, code that you have written and/or other resources you have used to try to get to your desired result. — OTStats, Nov 22 '19 at 20:15
Hi! can you provide data from one of your data frames? To apply clustering appropriately, you need to know the kind of distribution and also what to expect. — StupidWolf, Nov 22 '19 at 20:37

score 1 · Accepted Answer · answered Nov 23 '19 at 16:52

If I understood you correct, you want to group similar data.frames together. All these data.frames have the same structure, so you need to "flatten" out the dataframe into a vector:

First we simulate data that looks like yours:

set.seed(100)
d1 <- replicate(10,
data.frame(id=1:2,
V1=rnorm(2,0,1),
V2=rnorm(2,0,1)),
simplify=FALSE)
names(d1) = paste("df",1:10,sep="")

d2 <- replicate(5,
data.frame(id=1:2,
V1=rnorm(2,3,1),
V2=rnorm(2,3,1)),
simplify=FALSE)
names(d2) = paste("df",11:15,sep="")

alldataframes = c(d1,d2)

I keep all 15 data frames in a list. First 10 (df1-10) have different distributions from last 5 (df11-15). First we flatten:

df_matrix = t(sapply(alldataframes,function(i)unlist(i[,-1])))

Now you have a matrix, every row corresponds to a data.frame, every column, a cell in your data.frame.

head(df_matrix)
           V11        V12         V21         V22
df1 -0.5021924  0.1315312 -0.07891709  0.88678481
df2  0.1169713  0.3186301 -0.58179068  0.71453271
df3 -0.8252594 -0.3598621  0.08988614  0.09627446
df4 -0.2016340  0.7398405  0.12337950 -0.02931671

You can do clustering on this, for example kmeans:

kmeans(df_matrix,2)
K-means clustering with 2 clusters of sizes 5, 10

Cluster means:
         V11       V12        V21       V22
1  2.6399811 3.9138233  2.3044190 3.0689895
2 -0.4166189 0.2219812 -0.1535242 0.7488705

Clustering vector:
 df1  df2  df3  df4  df5  df6  df7  df8  df9 df10 df11 df12 df13 df14 df15 
   2    2    2    2    2    2    2    2    2    2    1    1    1    1    1 

Within cluster sum of squares by cluster:
[1] 26.55861 12.17958
 (between_SS / total_SS =  74.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

Please, what if actually There were only 1 column in each dataframe and all of them have the same number of rows(V1 is the ID). and the matrix should be having a column for id(row number of the dataframe) and the values are the V2 of each datframe? — pochi, Nov 25 '19 at 19:09

Clustering many datasets R by 2 variables

1 Answers1