Grouping the levels of a variable based on their mean in r

Question

I want to group my levels based on the mean price of each group, is this the right way to do it?

ames.train.c <- ames.train.c %>%
  group_by(Neighborhood) %>%
   mutate(Neighborhood.Cat = ifelse(mean(price) < 140000, "A", 
            ifelse(mean(price) < 200000, "B",
            ifelse(mean(price) < 260000, "C",
            ifelse(mean(price) < 300000, "D",
            ifelse(mean(price) < 340000, "E"))))))

the data can be found here: https://d3c33hcgiwev3.cloudfront.net/_fc6ea3b3b1af3f4fd9afb752e85d4299_ames_train.Rdata?Expires=1633651200&Signature=P7oxFR0IzJ2UP73GI0aJVua67DxUlvoWYhXdQwHf2CZefX2J~0KAxosAWMHtHxcKH81l87~uRBS0FqBb2MUA2UCQUWCg3ldR9mBQypVTq4ofv3wwOq3-r7d6hw1zM72FYfX2oRYgsKzTl5ucb9oQVUa~jBOW1tF3sTtL0h-ykr4_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A

Since you did not provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) ist impossible to tell if this works for your data, but it looks fine to me - what are you struggling with? — dario, Oct 03 '21 at 16:58
I would replace the multiple ifelse commands by one case_when command. — deschen, Oct 03 '21 at 17:02

score 1 · Accepted Answer · answered Oct 03 '21 at 17:20

I think this approach might help you

library(dplyr)

cut_breaks <- c(0,140000,200000,260000,300000,340000)
cut_labels <- c("A","B","C","D","E")

  ames.train.c %>%
  group_by(Neighborhood) %>%
  mutate(Neighborhood.Cat = cut(mean(price),cut_breaks,labels = cut_labels))

score 0 · Answer 2 · answered Oct 03 '21 at 17:27

You didn't give us the data so I had to prepare it myself.


library(tidyverse)

df = tibble(
  Neighborhood = rep(1:5, each=1000),
  price = c(rnorm(1000, 100000, 1000),
            rnorm(1000, 150000, 1000),
            rnorm(1000, 90000, 1000),
            rnorm(1000, 200000, 1000),
            rnorm(1000, 300000, 1000))
)

Now we will create a function for assigning categories.

f = function(data) data %>% mutate(
  Neighborhood.Cat = 
    case_when(
      mean(price) < 140000  ~ "A",
      mean(price) < 200000  ~ "B",
      mean(price) < 260000  ~ "C",
      mean(price) < 300000  ~ "D",
      mean(price) < 340000  ~ "E"
  ))

With this function, you can modify groups in the following way:

df = df %>% group_by(Neighborhood) %>% 
  group_modify(~f(.x))

Let's check the effect

df %>% group_by(Neighborhood) %>% 
  summarise(mean = mean(price),
            Cat = Neighborhood.Cat[1])

output

# A tibble: 5 x 3
  Neighborhood    mean Cat  
         <int>   <dbl> <chr>
1            1 100020. A    
2            2 150011. B    
3            3  89981. A    
4            4 200052. C    
5            5 299998. D

Grouping the levels of a variable based on their mean in r

2 Answers2