Gower distance on big data

Question

I want to perform clustering on geographical data to classify types of landscape within my region.

My data consist of five variables (temperature, amplitude of temperature, precipitation, altitude and soil type) for each field of regular grid. I have a bit more than 1 million fields (=rows in data frame).

Four of the variables are numeric and soil type is categorical variable described as a number. (Numeric data are already standardized.) I decided to count Gower distance dissimilarity matrix and to perform PCA and hierarchical clustering on this matrix. However, the data is too big.

   SOIL  PREC     TEMP     ALT      AMP
0  6     1.000    1.146    0.157   -0.579
1  6     0.948    1.224    0.154   -0.579
2  5     1.000    1.146    0.201   -0.662
3  6     1.078    1.093    0.177   -0.620
4  6     1.000    1.146    0.182   -0.620
5  6     1.000    1.146    0.186   -0.599

I don't want to sample, because variables are in gradients. I tried to count frequencies and perform Gower distance on smaller data, but it's still too big.

I think I might (1) manually chunk the big dataset, (2) to each chunk matrix add two extra rows with max and min values of the variables as a "description" of extent of each variable for the distance analysis, (3) count the dissimilarity matrix for each chunk with gower.dist function, (4) remove the extra rows and (5) merge all of the chunk dissimilarity matrices into one big dissimilarity matrix.

Do you consider this a correct and working way? Do you have any other suggestions how to deal this issue?

Is it correct to perform PCA on dissimilarity matrix?

Nice question. Did you check solutions for large dissimilarity matrices with other measures? e.g. http://stackoverflow.com/questions/18892944/how-to-calculate-massive-dissimilarity-matrix-in-r — Filippo Mazza, Mar 17 '17 at 10:56

Gower distance on big data

0 Answers0