In total I have 21k observations. In order to do a cluster analysis, I would like to group the 21K observations by the column "Neighborhood" (there are a total of 140 neighborhoods). So I would like to group by "neighborhood" and get the mean of the quantitative variables for each neighborhood (e.g. "buy price") and the mode for the qualitative variables (e.g. "energy certificate", "has parking", etc). So that the dataset is simply 140 rows (neighborhoods) and their means and modes depending on the variable concerned. I hope someone can help me. Thank you very much in advance.
Asked
Active
Viewed 87 times
1 Answers
1
I'll emulate with mtcars
and dplyr
.
library(dplyr)
quant <- c("mpg", "disp", "hp")
qual <- c("vs", "am", "gear")
mtcars %>%
group_by(cyl) %>%
summarize(across(quant, mean), across(qual, ~ names(sort(table(.),decreasing=TRUE))[1]))
# # A tibble: 3 x 7
# cyl mpg disp hp vs am gear
# <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
# 1 4 26.7 105. 82.6 1 1 4
# 2 6 19.7 183. 122. 1 0 4
# 3 8 15.1 353. 209. 0 0 3
The names(table(.))[1]
is meant to be your "mode" of a qualitative variable. We can validate that it is doing what we expect with a quick table:
xtabs(~cyl+vs, data=mtcars)
# vs
# cyl 0 1
# 4 1 10
# 6 3 4
# 8 14 0
xtabs(~cyl+am, data=mtcars)
# am
# cyl 0 1
# 4 3 8
# 6 4 3
# 8 12 2
xtabs(~cyl+gear, data=mtcars)
# gear
# cyl 3 4 5
# 4 1 8 2
# 6 2 4 1
# 8 12 0 2
showing that for gear 4, 6, and 8, respectively, the most common vs
is 1
, 1
, and 0
; for am
: 1
, 0
, and 0
; for gear
: 4
, 4
, and 3
. Those correspond to the values in the return above.
In your case, change cyl
to neighborhood
, and make sure your qual
and quant
have the desired variables listed.

r2evans
- 141,215
- 6
- 77
- 149