I have a dataset composed of 54 000 rows and a few columns (7). My values are both numeric and alphanumeric (qualitative and quantitative variables). I want to cluster it using function hclust in R.
Let's take an example :
X <- data.frame(rnorm(54000, sd = 0.3),
rnorm(54000, mean = 1, sd = 0.3),
sample( LETTERS[1:24], 54000, replace=TRUE),
sample( letters[1:10], 54000, replace=TRUE),
round(rnorm(54000,mean=25, sd=3)),
round(runif(n = 54000,min = 1000,max = 25000)),
round(runif(54000,0,200000)))
colnames(X) <- c("A","B","C","D","E","F","G")
If I use the hclust function like this :
hclust(dist(X), method = "ward.D")
I get this error message :
Error: cannot allocate vector of size 10.9 Gb
What is the problem ? I'm trying to create a 54k * 54k matrix which is too big to be computed by my PC (4Go of RAM). I've read that since R3.0.0, the software is now in 64 bits (able to work with a 2.916e+09 matrix like in my example) so limitations are from my computer. I've tried it with hclust in stats / fastcluster/ flashClust and get the same problem.
In this packages, hclust are described like that :
hclust(d, method="complete", members=NULL)
flashClust(d, method = "complete", members=NULL)
d a dissimilarity structure as produced by dist.
We always need a dist
matrix to make this function work. I've also tried to set higher the limitations of my computer for R session using this :
memory.limit(size = 4014)
memory.size(max = TRUE)
Question :
Is it possible to use a hierarchical clustering (or similar way to cluster data) whithout using this dist()
matrix for a quantitative/qualitative dataset with R ?
Edit :
About k-means :
The method of k-means works great for a big dataset composed of numerical values. In my example, I got both numeric and alphanumeric values. I've tried to tranform my qualitative variables into binary numerical variables to do the process of k-means :
First dataframe (example) :
Col1 Col2 Col3
1 12 43.93145 Alpha
2 45 44.76081 Beta
3 48 45.09708 Gamma
4 31 45.42278 Alpha
5 12 46.53709 Delta
6 7 39.07841 Beta
7 78 49.60947 Alpha
If I transform this into binary variables, I get this :
Col1 Col2 Alpha Beta Gamma Delta
1 12 44.29369 1 0 0 0
2 45 43.90610 0 1 0 0
3 48 44.82659 0 0 1 0
4 31 43.09096 1 0 0 0
5 12 42.71190 0 0 0 1
6 7 43.71710 0 1 0 0
7 78 42.24293 1 0 0 0
It's OK if I only got a few modalities but in a real dataset, we could get about 10.000 modalities for a 50k rows base. I don't think k-means is the solution of this type of problem.