0

This question is based on How do I calculate a grouped z score in R using dplyr?.

Here data are scaled (zscores) for different groups and ungrouped.

   dat = iris %>% 
      gather(variable, value, -Species) %>%
      group_by(Species, variable) %>% 
      mutate(z_score_group = (value - mean(value)) / sd(value)) %>%
      ungroup %>% 
      mutate(z_score_ungrouped = (value - mean(value)) / sd(value))

Scaling ungrouped preserves the order of the data.

> identical(order(dat$z_score_ungrouped), order(dat$value))
[1] TRUE

However, interestingly the data change their order by scaling group wise.

> identical(order(dat$z_score_group), order(dat$value))
[1] FALSE

In my opinion scaling should never change the order of data because this has a huge impact on rank based analysis (e.g. ROC-curves). Does anyone have an idea why grouping changes the order?

MrNetherlands
  • 920
  • 7
  • 14
  • 1
    But when you scale using groups/subsets of data you're using a different `mean` and `sd`. Consider it as having a different baseline. That could lead to a a specific value to be large compared to a sub group, but not compared to the whole dataset. Consider normalising the sets {1,2,5}, {20,21,22} together and then separately and check what you get for value 5 for example. – AntoniosK Mar 22 '18 at 16:54
  • 1
    Run this as an example `x1 = c(1,2,5); x2 = c(20,21,22); x = c(x1,x2); (x1 - mean(x1)) / sd(x1); (x2 - mean(x2)) / sd(x2); (x - mean(x)) / sd(x)` – AntoniosK Mar 22 '18 at 16:57
  • 1
    Ah, I see, you are totally right. The order is of course only preserved within and not across the groups. Thanks for clarifying – MrNetherlands Mar 22 '18 at 17:13

0 Answers0