0

So, I and some other colleagues developed a hierarchical clustering algorithm to basically find the main clusters involving agricultural industries according to a particular city (e.g. London city).. We structured this algorithm in R. It is working perfectly. So, according to our filters that we inserted in the algorithm, we were able to generate 6 clustering scenarios to London city. For example, the first scenario generated 2 clusters, the second scenario 5 clusters, and so on. I would therefore like some help on how I can choose the most appropriate one. I saw that there are some packages that help in this process, like pvclust, but I couldn't use it for my case. I am inserting a brief executable code below to show the essence of what I want.

Any help is welcome! If you know how to use using another package, feel free to describe.

Best Regards.

library(rdist)
library(geosphere)
library(fpc)
 
 
df<-structure(list(Industries = c(1,2,3,4,5,6), 
+                    Latitude = c(-23.8, -23.8, -23.9, -23.7, -23.7,-23.7), 
+                    Longitude = c(-49.5, -49.6, -49.7, -49.8, -49.6,-49.9), 
+                    Waste = c(526, 350, 526, 469, 534, 346)), class = "data.frame", row.names = c(NA, -6L))
 
df1<-df
 
#clusters
coordinates<-df[c("Latitude","Longitude")]
d<-as.dist(distm(coordinates[,2:1]))
fit.average<-hclust(d,method="average") 
 
clusters<-cutree(fit.average, k=2) 
df$cluster <- clusters 
> df
  Industries Latitude Longitude Waste cluster
1          1    -23.8     -49.5   526       1
2          2    -23.8     -49.6   350       1
3          3    -23.9     -49.7   526       1
4          4    -23.7     -49.8   469       2
5          5    -23.7     -49.6   534       1
6          6    -23.7     -49.9   346       2
> 
clusters1<-cutree(fit.average, k=5) 
df1$cluster <- clusters1
> df1
  Industries Latitude Longitude Waste cluster
1          1    -23.8     -49.5   526       1
2          2    -23.8     -49.6   350       1
3          3    -23.9     -49.7   526       2
4          4    -23.7     -49.8   469       3
5          5    -23.7     -49.6   534       4
6          6    -23.7     -49.9   346       5
> 
Antonio
  • 1,091
  • 7
  • 24
  • Look at the [Cluster Analysis Task View](https://cran.r-project.org/web/views/Cluster.html), particularly section Additional Functionality. The package `clValid` may have what you want. – dcarlson Dec 12 '20 at 22:41

1 Answers1

1

Maybe try something like this (note I'm not sure of this approaches' mathematical rigour):

library(tidyverse)
library(geosphere)


clustered_df <- 
  df %>%
  arrange(Latitude, Longitude) %>%
  mutate(
    dist_diff = c(0, geosphere::distVincentyEllipsoid(cbind(.$Latitude, .$Longitude))),
    separate_clust = dist_diff > median(dist_diff[-1]),
    cluster_no = 1 + cumsum(separate_clust)
  ) %>% 
  select(Industries, Longitude, Latitude, Waste, cluster_no))

library(leaflet)

leaflet(clustered_df) %>% 
  addTiles() %>%
  addAwesomeMarkers(lat=~Latitude, lng = ~Longitude, label=~as.character(cluster_no)) 
hello_friend
  • 5,682
  • 1
  • 11
  • 15