0

I am preparing a dataset for a PCA, all my variables are numeric so I can calculate the median of all of them.

I have two grouping variables. I need to calculate the median of the group (say first group is CATEGORIA=6 and Dpto='A' and so on) and use this value as a replacement for the cells with NA on them, my code is:

for (j in 10:46){
 consolidado1<-consolidado%>% 
 group_by(CATEGORIA,Dpto,.add=T)%>%
 mutate_at(vars(j),~ ifelse(is.na(.),median(consolidado[,j],na.rm=T), .))
}

However it's not replacing anything and whenever I try to test some values of j, for example:

 consolidado1<-consolidado%>% 
 group_by(CATEGORIA,Dpto,.add=T)%>%
 mutate_at(vars(11),~ ifelse(is.na(.),median(consolidado[,11],na.rm=T), .))

The NAs are replaced not with the group median but with the median of the whole column.

What's the correct way of doing this? How do I properly extract the group median?

JeffCJ
  • 15
  • 4

1 Answers1

2

When you are subsetting the column from the dataframe (consolidado[,11]) it returns the entire dataframe column and does not consider the groups hence you get median of whole column. You can use . to refer the column values and take grouped median from it.

library(dplyr)
consolidado1 <- consolidado %>% 
                 group_by(CATEGORIA,Dpto) %>%  
                 mutate(across(10:46, ~ ifelse(is.na(.),median(.,na.rm=TRUE), .)))
                 #With `mutate_at`
                 #mutate_at(10:46,~ ifelse(is.na(.),median(.,na.rm=TRUE), .))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • That's exactly what I needed, thanks!, so if I understood correctly the dot is used to reference the values of the groups? – JeffCJ Aug 25 '20 at 18:31