3

I have a dataset composed of 54 000 rows and a few columns (7). My values are both numeric and alphanumeric (qualitative and quantitative variables). I want to cluster it using function hclust in R.

Let's take an example :

X <- data.frame(rnorm(54000, sd = 0.3),
                rnorm(54000, mean = 1, sd = 0.3),
                sample( LETTERS[1:24], 54000, replace=TRUE),
                sample( letters[1:10], 54000, replace=TRUE),
                round(rnorm(54000,mean=25, sd=3)),
                round(runif(n = 54000,min = 1000,max = 25000)),
                round(runif(54000,0,200000)))
colnames(X) <- c("A","B","C","D","E","F","G") 

If I use the hclust function like this :

hclust(dist(X), method = "ward.D")

I get this error message :

Error: cannot allocate vector of size 10.9 Gb

What is the problem ? I'm trying to create a 54k * 54k matrix which is too big to be computed by my PC (4Go of RAM). I've read that since R3.0.0, the software is now in 64 bits (able to work with a 2.916e+09 matrix like in my example) so limitations are from my computer. I've tried it with hclust in stats / fastcluster/ flashClust and get the same problem.

In this packages, hclust are described like that :

hclust(d, method="complete", members=NULL)
flashClust(d, method = "complete", members=NULL)

d   a dissimilarity structure as produced by dist.

We always need a dist matrix to make this function work. I've also tried to set higher the limitations of my computer for R session using this :

memory.limit(size = 4014)
memory.size(max = TRUE)

Question :

Is it possible to use a hierarchical clustering (or similar way to cluster data) whithout using this dist() matrix for a quantitative/qualitative dataset with R ?

Edit :

About k-means :

The method of k-means works great for a big dataset composed of numerical values. In my example, I got both numeric and alphanumeric values. I've tried to tranform my qualitative variables into binary numerical variables to do the process of k-means :

First dataframe (example) :

Col1     Col2  Col3
1   12 43.93145 Alpha
2   45 44.76081  Beta
3   48 45.09708 Gamma
4   31 45.42278 Alpha
5   12 46.53709 Delta
6    7 39.07841  Beta
7   78 49.60947 Alpha

If I transform this into binary variables, I get this :

Col1     Col2 Alpha Beta Gamma Delta
1   12 44.29369     1    0     0     0
2   45 43.90610     0    1     0     0
3   48 44.82659     0    0     1     0
4   31 43.09096     1    0     0     0
5   12 42.71190     0    0     0     1
6    7 43.71710     0    1     0     0
7   78 42.24293     1    0     0     0

It's OK if I only got a few modalities but in a real dataset, we could get about 10.000 modalities for a 50k rows base. I don't think k-means is the solution of this type of problem.

Community
  • 1
  • 1
ARandomUser
  • 130
  • 12
  • I believe in this case your only option, AFAIK, is to use `kmeans` either directly or used inside the `FactorMineR::HCPC` function as answered [here](http://stackoverflow.com/questions/27269555/r-issue-with-a-hierarchical-clustering-after-a-multiple-correspondence-analysis) – cdeterman Jul 06 '16 at 15:34
  • @cedeterman : Thanks, appreciate your help. I tried 2 new models with k-means but it doesn't fit well to answer my problem. – ARandomUser Jul 07 '16 at 14:54

1 Answers1

1

From reading your question, it seems there are 2 problems:

1. You have a fairly large amount of observations for clustering
2. The categorical variables have high cardinality

My advice:

1) You can just take a sample and use fastcluster::hclust, or use clara. Probably after sorting out 2) you can use more observations, in any case it's potentially ok to use a sample. Try to take a stratified sample of the categories.

2) You basically need to represent these categories in a numeric format, without having 10000 columns more. You could use PCA or a Discrete version of it. A few questions deal with this problem: q1, q2

Community
  • 1
  • 1
marbel
  • 7,560
  • 6
  • 49
  • 68
  • 1
    Thanks for your time, appreciate it. I'm gonna try something using your advice and I'll post my code later as an anwer. – ARandomUser Jul 22 '16 at 09:09