0

I would like to insert new coordinates in my scatterplot, from another matrix. I am using the fviz_cluster function to generate the graph for the clusters. I would like to insert the coordinates of the matrix called Center of mass in my graph, as they are the best coordinates of each cluster for installing a manure composting machine. I can generate the scatter plot only for the properties, as attached. The codes are below:

> library(readxl)
> df <- read_excel('C:/Users/testbase.xlsx') #matrix containing waste production, latitude and longitude
> dim (df)
[1] 19  3
> d<-dist(df)
> fit.average<-hclust(d,method="average") 
> clusters<-cutree(fit.average, k=6) 
> df$cluster <- clusters # inserting column with determination of clusters
> df
    Latitude    Longitude  Waste   cluster
     <dbl>       <dbl>     <dbl>     <int>
 1    -23.8     -49.6      526.        1
 2    -23.8     -49.6      350.        2
 3    -23.9     -49.6      526.        1
 4    -23.9     -49.6      469.        3
 5    -23.9     -49.6      285.        4
 6    -23.9     -49.6      175.        5
 7    -23.9     -49.6      175.        5
 8    -23.9     -49.6      350.        2
 9    -23.9     -49.6      350.        2
10    -23.9     -49.6      175.        5
11    -23.9     -49.7      350.        2
12    -23.9     -49.7      175.        5
13    -23.9     -49.7      175.        5
14    -23.9     -49.7      364.        2
15    -23.9     -49.7      175.        5
16    -23.9     -49.6      175.        5
17    -23.9     -49.6      350.        2
18    -23.9     -49.6      45.5        6
19    -23.9     -49.6      54.6        6

> ########Generate scatterplot
> library(factoextra)
> fviz_cluster(list(data = df, cluster = clusters))
> 
> 
>  ##Center of mass, best location of each cluster for installation of manure composting machine
> center_mass<-matrix(nrow=6,ncol=2)
> for(i in 1:6){
+ center_mass[i,]<-c(weighted.mean(subset(df,cluster==i)$Latitude,subset(df,cluster==i)$Waste),
+ weighted.mean(subset(df,cluster==i)$Longitude,subset(df,cluster==i)$Waste))}
> center_mass<-cbind(center_mass,matrix(c(1:6),ncol=1)) #including the index of the clusters
> head (center_mass)
          [,1]      [,2] [,3]
[1,] -23.85075 -49.61419    1
[2,] -23.86098 -49.64558    2
[3,] -23.86075 -49.61350    3
[4,] -23.86658 -49.61991    4
[5,] -23.86757 -49.63968    5
[6,] -23.89749 -49.62372    6

enter image description here

New scatterplot

enter image description here

Scatterplot considering Longitude and Latitude

vars = c("Longitude", "Latitude")

gg <- fviz_cluster(list(df, cluster = dfcluster), choose.var=vars)

gg

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks for the edition Roman Luštrik and Tjebo. Could you give me any ideas for my problem above? –  Mar 23 '20 at 01:44
  • It's not quite clear to me what you exaclty want to achieve. Also, your problem is not reproducible. Please kindly try to make it reproducible (see here how: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). or https://www.r-bloggers.com/three-tips-for-posting-good-questions-to-r-help-and-stack-overflow/ Ideally don't post output of your data, but create pertinent sample data. And show an output what you would expect. This will make it much more likely to get help – tjebo Mar 23 '20 at 11:58
  • Have a look at the reprex package. Tip: Use RStudio instead of R GUI. install the reprex package and it will be integrated in RStudio. And then create reprex from your code , and you will create nice reproducible code – tjebo Mar 23 '20 at 12:00

2 Answers2

0

Since the fviz_cluster() function returns a ggplot object you should be able to add new points to the plot as you do with ggplot().

Here is an example using mock data, where I only use functions from the ggplot2 package (since I don't have the factoextra package installed).

# Dataset with all the points (it's your df data frame)
df <- data.frame(x=1:10, y=1:10)

# Dataset with two "center" points to add to the df points (it's your center_mass matrix)
dc <- data.frame(x=c(2.5, 7.5), y=c(2.5, 7.5))

# ggplot with the initial plot of the df points (it mimics the result from fviz_cluster())
# Note that the plot is not yet shown, it's simply stored in the gg variable
gg <- ggplot() + geom_point(data=df, mapping=aes(x,y))

# Create the plot by adding the center points to the above ggplot as larger red points
gg + geom_point(data=dc, mapping=aes(x,y), color="red", size=3)

which produces:

enter image description here

In your case you should:

  1. Replace the line:
    fviz_cluster(list(data = df, cluster = clusters))
    with:
    gg <- fviz_cluster(list(data = df, cluster = clusters))
  2. Convert the center_mass matrix to a data frame (by simply using as.data.frame(center_mass)) before passing it to the geom_point() call in the last line of my example above, and assign appropriate column names with the colnames() function to which you can refer to in the mapping option of geom_point().

Let me know if this works for you!

mastropi
  • 1,354
  • 1
  • 10
  • 14
  • Thank you very much for your help Mastropi. I managed to generate the map contemplating the center of mass equal to yours, but using the fviz_cluster function no, unfortunately. However, it was a very valid help. One more question, I can insert the information corresponding to the Waste variable (df database) in this graph, is it possible? I would like to show the value corresponding to the waste variable in the graph for each point. Thank you again! –  Mar 26 '20 at 03:25
  • Actually using the `fviz_cluster()` function works as well. I have just tried it. The problem is that `fviz_cluster()` plots the points on the scale of the first 2 principal components of the data. So, you cannot simply add the points in your `center_mass` matrix to that plot... you need to first scale its entries to the principal component axes. In addition, I found a mistake in your code that is giving you the wrong plot of the clusters. Will post my full answer tomorrow, as I need to clean it up. – mastropi Mar 27 '20 at 00:14
  • Thanks for answering. I am anxious to know what the error is. I will leave above the graph that I managed to make according to your suggestions. Sorry to ask these two more questions, if you know: (a), I would like to show the value corresponding to waste (df database) in the graph for each point?. (b) I can make a circle between the center of mass, which will be the center of the circle in this case, and the points of each cluster, respectively Thank you again!. –  Mar 27 '20 at 00:40
  • I just added a new answer with the solution to your questions on (i) adding the weighted centers to the `fviz_cluster()` plot, (ii) adding the `Waste` values as point labels. I leave up to you your last request about the circle... it might not be so easy, but you might want to look at the `ellipse` package. And bare in mind that you first need to transform any data you want to add to the plot to the coordinates defined by the first two principal components! – mastropi Mar 27 '20 at 20:34
  • Thanks! I appreciate your help. I would be very happy to help you with any other questions. However, I think I forgot to mention that for the generation of clusters I would like to consider only Longitude and Latitude, as in the last graph I sent you yesterday. You can see that x-axis is longitude and y-axis latitude. I made a new graph according to their formulas and adjusted (above for visualization). From this graph, I intend to insert the centers of mass and later the information referring to the Waste variable in each point of this graph. Would your code change a lot? –  Mar 28 '20 at 02:24
  • Hello, mastropi! Do you know how to work with Shiny? If yes, Can you check out the following question (https://stackoverflow.com/questions/61298886/issues-related-to-shiny-from-rstudio/61303161?noredirect=1#comment108451332_61303161)? Basically I want to insert a graph and a table that I made in my Rscript in my shiny code. Thank you very much! –  Apr 19 '20 at 13:59
  • Hi Jovani, I work with Shiny once in a while but not very often (although incidentally, I will start working on a new Shiny app soon). From the answers to the post your referred here, I presume you have received a reasonable answer to your problem? – mastropi Apr 20 '20 at 08:52
  • Thanks for the quick response. It worked, I commented with you here, because I realized that you have a great experience in R. But thanks anyway! –  Apr 20 '20 at 11:44
  • Mastropi, please, could you take a look at two questions asked by my brother Jose: https://stackoverflow.com/questions/61591674/general-function-to-insert-the-colors-of-the-clusters-in-my-map-made-by-the-leaf and https://stackoverflow.com/questions/61595335/find-the-shortest-path-between-points-on-a-map-made-by-the-leaflet-package We are working together. Thank you very much friend. –  May 04 '20 at 16:18
0

This answer shows the solution using the fviz_cluster() function of the factoextra package, instead of the mock example included in my previous answer.

Starting off from the data frame posted by the OP that already includes the clusters found by hclust() and cutree():

structure(list(Latitude = c(-23.8, -23.8, -23.9, -23.9, -23.9, 
-23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, -23.9, 
-23.9, -23.9, -23.9, -23.9, -23.9), Longitude = c(-49.6, -49.6, 
-49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.6, -49.7, 
-49.7, -49.7, -49.7, -49.7, -49.6, -49.6, -49.6, -49.6), Waste = c(526, 
350, 526, 469, 285, 175, 175, 350, 350, 175, 350, 175, 175, 364, 
175, 175, 350, 45.5, 54.6), cluster = c(1L, 2L, 1L, 3L, 4L, 5L, 
5L, 2L, 2L, 5L, 2L, 5L, 5L, 2L, 5L, 5L, 2L, 6L, 6L)), class = "data.frame",
row.names = c(NA, -19L))

we start by generating the plot of the clusters using fviz_cluster():

library(factoextra)

# Analysis variables (used when computing the clusters)
vars = c("Latitude", "Longitude", "Waste")

# Initial plot showing the clusters on the first 2 PCs
gg <- fviz_cluster(list(data = df, cluster = df$cluster), choose.vars=vars)
gg

which gives:

Plot of clusters

Note that this plot is different from the one shown by the OP. The reason is that the code used by the OP makes the cluster variable present in df to be included in the computation of the principal components on which the plot is based. The reason is that all variables in the input data frame are used to generate the plot. (This conclusion was reached by looking at the source code of fviz_cluster() and running it in debug mode.)

Now we compute the Waste-weighted center of each cluster as well as the per-cluster average of Waste (needed below to add the centers to the plot):
(note that the code is now generalized to any number of clusters found)

# Number of clusters found
n_clusters = length( unique(df$cluster) )

# Waste-weighted cluster centers
center_mass <- matrix(nrow=n_clusters, ncol=2, dimnames=list(NULL, c("Latitude", "Longitude")))
for(i in 1:n_clusters) {
  center_mass[i,] <- c(weighted.mean(subset(df,cluster==i)$Latitude,subset(df,cluster==i)$Waste),
                       weighted.mean(subset(df,cluster==i)$Longitude,subset(df,cluster==i)$Waste))
}

# We now compute the average Waste by cluster since,
# in order to add the centers to the fviz_cluster() plot
# we need the information for all three variables used
# in the clustering analysis and generation of the plot
center_mass_with_waste = cbind(center_mass, aggregate(Waste ~ cluster, mean, data=df))
head(center_mass_with_waste)

which gives:

   Latitude Longitude cluster    Waste
1 -23.85000 -49.60000       1 526.0000
2 -23.88344 -49.63377       2 352.3333
3 -23.90000 -49.60000       3 469.0000
4 -23.90000 -49.60000       4 285.0000
5 -23.90000 -49.64286       5 175.0000
6 -23.90000 -49.60000       6  50.0500

NOW the most interesting part starts: adding the weighted centers to the plot. Since the plot is done on the principal component axes, we need to compute the principal component coordinates for the centers.

This is achieved by running the principal component analysis (PCA) on the full data and applying the PCA axis rotation to the coordinates of the centers. There are at least two functions in the stats package of R that can be used to run PCA: prcomp() and princomp(). The preferred method is prcomp() (as it uses Singular Value Decomposition to perform the eigenanalysis and uses the usual N-1 as divisor for the variance as opposed to N which is used by princomp()). In addition prcomp() is the function used by fviz_cluster().

Therefore:

# We first scale the analysis data as we will need the center and scale information
# to properly center and scale the weighted centers for plotting
# Note that proper PCA is always done on centered and scaled data
# in order to accommodate different variable scales and make variables comparable.
# in addition, this is what is done inside fviz_cluster().
X <- scale( df[,vars] )

# We run PCA on the scaled data
summary( pca <- prcomp(X, center=FALSE, scale=FALSE) )

which gives:

Importance of components:
                          PC1    PC2    PC3
Standard deviation     1.2263 0.9509 0.7695
Proportion of Variance 0.5012 0.3014 0.1974
Cumulative Proportion  0.5012 0.8026 1.0000

Observe that the proportion of the explained variance by the first 2 PCs coincide with those shown in the initial plot of the clusters, namely: 50.1% and 30.1%, respectively.

We now center and scale the weighted centers, using the same center and scaling operation performed on the full data (this is needed for plotting):

# We center and scale the weighted centers
# (based on the information stored in the attributes of X)
center_mass_with_waste_scaled = scale(center_mass_with_waste[, vars],
                                      center=attr(X, "scaled:center"),
                                      scale=attr(X, "scaled:scale"))

# We compute the PC coordinates for the centers
center_mass_with_waste_pcs = predict(pca, center_mass_with_waste[,vars])

Fnally we add the Waste-weighted centers to the plot (as red filled points) and the Waste values as labels. Here we distinguish between number of analyzed variables (nvars) = 2 or > 2, since fviz_cluster() only performs PCA when nvars > 2, in the case nvars = 2 it just scales the variables.

# And finally we add the points to the plot (as red filled points)
# distinguishing two cases, because fviz_cluster() does different things
# in each case (i.e. no PCA when nvars = 2, just scaling)
if (length(vars) > 2) {
  # fviz_cluster() performs PCA and plots the first 2 PCs
  # => use PC coordinates for the centers
  gg + geom_point(data=as.data.frame(center_mass_with_waste_pcs),
                  mapping=aes(x=PC1, y=PC2),
                  color="red", size=3) +
       geom_text(data=as.data.frame(pca$x),
                 mapping=aes(x=PC1, y=PC2, label=df$Waste),
                 size=2, hjust=-0.5)
} else {
  # fviz_cluster() does NOT perform PCA; it simply plots the standardized variables
  # => use standardized coordinates for the centers

  # Get the names of the analysis variables as expressions (used in aes() below)
  vars_expr = parse(text=vars)
  gg + geom_point(data=as.data.frame(center_mass_with_waste_scaled),
                  mapping=aes(x=eval(vars_expr[1]), y=eval(vars_expr[2])),
                  color="red", size=3) +
       geom_text(data=as.data.frame(X),
                 mapping=aes(x=eval(vars_expr[1]), y=eval(vars_expr[2]), label=df$Waste),
                 size=2, hjust=-0.5)
}

which gives (when nvars = 3):

Plot with centers and Waste labels

Note however that the red points essentially coincide with the original cluster centers computed by fiz_cluster() and this is because the Waste-weighted averages of Latitude and Longitude are almost the same as their respective non-weighted averages (furthermore, the only center that slightly differs between both calculation methods is the center for cluster 2 --as seen by comparing the weighted and unweighted averages per cluster (not done here)).

mastropi
  • 1,354
  • 1
  • 10
  • 14
  • Following your last comment to my previous answer, I have edited this one so that it also works for the case when the analyzed variables are just `Longitude` and `Latitude` (see the `if` block in the last part of the code where the centers are added to the `fviz_cluster()` graph). Note that I did _not_ edit the generated plot nor the definition of `vars`, which still correspond to the case of three analyzed variables. – mastropi Mar 28 '20 at 15:11
  • I hope this satisfies your needs. Also, since you are a new contributor I kindly point you to the instructions of what to do when you think a posted answer is helpful to your question: https://stackoverflow.com/help/someone-answers . I apologize if you already were aware of this procedure. Thanks! :) – mastropi Mar 28 '20 at 15:11