0

I have a huge data set (200,000 rows * 40 columns) where each row represents an observation and each column is a variable. For this data, I would like to do hierarchical clustering. Unfortunately, as the number of rows is huge, then it is impossible to do this using my computer since I need to compute the distance matrix for all pairs of observations so (200,000 * 200,000) matrix.

The answer of this question suggests to use first kmeans to calculate a number of centers, then to perform the hierarchical clustering on the coordinates of these centers using the library FactoMineR.

The problem: I keep getting an error when applying the same method!

#example

# Data
MyData <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

kClust_MyData <- kmeans(MyData, 1000, iter.max=20)
Hclust_MyData <- HCPC(kClust_MyData$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(Hclust_MyData, choice="tree")

But

Error in catdes(data.clust, ncol(data.clust), proba = proba, row.w = res.sauv$call$row.w.init) : 
  object 'data.clust' not found
  • There is no need to use a third party library for hierarchical clustering: there is the builtin function *hclust*. If you need a particularly efficient implementation, you can look at teh package *fastcluster*. Have you tried any of these? โ€“ cdalitz Oct 21 '20 at 12:39
  • @cdalitz Honestly no! I didn't try with functions of the library `fastcluster`. I understood from other posts that using hierarchical clustering directly for my data set `200,00ยต0 * 40` is not useful because of the distance matrix, so I am looking for alternatives or other approaches to apply hierarchical clustering in an indirect way. Do you have any recommendations please? โ€“ Sophie Allan Oct 21 '20 at 13:25
  • For the size problem, you already have the workaround with applying *kmeans* first. The error message, however, comes from the *HCPC* functions. I thus wondered why you used them for hierarchical clustering and did not use *hclust* (either builtin or from *fastcluster*). โ€“ cdalitz Oct 21 '20 at 13:43

1 Answers1

0

The package fastcluster has a method hclust.vector that does not require a distance matrix as input, but computes the distances itself in a more memory efficient way. From the fastcluster manual:

The call
hclust.vector(X, method='single', metric=[...])
is equivalent to
hclust(dist(X, metric=[...]), method='single')
but uses less memory and is equally fast

cdalitz
  • 1,019
  • 1
  • 6
  • 7