Group the data by column and obtain the mean of the rest of the variables in R

Question

The dataset is shown as follows, with quantitative and qualitative variables

In total I have 21k observations. In order to do a cluster analysis, I would like to group the 21K observations by the column "Neighborhood" (there are a total of 140 neighborhoods). So I would like to group by "neighborhood" and get the mean of the quantitative variables for each neighborhood (e.g. "buy price") and the mode for the qualitative variables (e.g. "energy certificate", "has parking", etc). So that the dataset is simply 140 rows (neighborhoods) and their means and modes depending on the variable concerned. I hope someone can help me. Thank you very much in advance.

Can you provide a reproducible example? Screenshots of your dataset won't help. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — william3031, May 12 '21 at 02:00

score 1 · Accepted Answer · answered May 11 '21 at 20:00

I'll emulate with mtcars and dplyr.

library(dplyr)
quant <- c("mpg", "disp", "hp")
qual <- c("vs", "am", "gear")

mtcars %>%
  group_by(cyl) %>%
  summarize(across(quant, mean), across(qual, ~ names(sort(table(.),decreasing=TRUE))[1]))
# # A tibble: 3 x 7
#     cyl   mpg  disp    hp vs    am    gear 
#   <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1     4  26.7  105.  82.6 1     1     4    
# 2     6  19.7  183. 122.  1     0     4    
# 3     8  15.1  353. 209.  0     0     3

The names(table(.))[1] is meant to be your "mode" of a qualitative variable. We can validate that it is doing what we expect with a quick table:

xtabs(~cyl+vs, data=mtcars)
#    vs
# cyl  0  1
#   4  1 10
#   6  3  4
#   8 14  0
xtabs(~cyl+am, data=mtcars)
#    am
# cyl  0  1
#   4  3  8
#   6  4  3
#   8 12  2
xtabs(~cyl+gear, data=mtcars)
#    gear
# cyl  3  4  5
#   4  1  8  2
#   6  2  4  1
#   8 12  0  2

showing that for gear 4, 6, and 8, respectively, the most common vs is 1, 1, and 0; for am: 1, 0, and 0; for gear: 4, 4, and 3. Those correspond to the values in the return above.

In your case, change cyl to neighborhood, and make sure your qual and quant have the desired variables listed.

Incredible maestro, thank you very much :)) – George May 11 '21 at 22:16 — George, May 11 '21 at 22:16

Group the data by column and obtain the mean of the rest of the variables in R

1 Answers1