8

Using the iris dataset I'm trying to calculate a z score for each of the variables. I have the data in tidy format, by performing the following:

library(reshape2)
library(dplyr)
test <- iris
test <- melt(iris,id.vars = 'Species')

That gives me the following:

  Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
3  setosa Sepal.Length   4.7
4  setosa Sepal.Length   4.6
5  setosa Sepal.Length   5.0
6  setosa Sepal.Length   5.4

But when I try to create a z-score column for each group (e.g. the z-score for Sepal.Length will not be comparable to that of Sepal. Width) using the following:

test <- test %>% 
  group_by(Species, variable) %>% 
  mutate(z_score = (value - mean(value)) / sd(value))

The resulting z-scores have not been grouped, and are based on all of the data.

What's the best way to return the z-scores by group using dpylr?

Many thanks!

Worville11
  • 93
  • 1
  • 1
  • 5
  • What exactly are you trying to get? If you run `test %>% group_by(Species, variable) %>% summarize(mean = mean(z_score), sd = sd(z_score))` you'll see that each combination of Species/variable has a mean of 0 and standard deviation of 1, so the z-score is being calculated by group. – Daniel Anderson Sep 12 '17 at 21:54
  • You're getting non-grouped result as you're calculating `z` for each value `value - ...`. Here you ask to return each value minus mean, sd ... – pogibas Sep 12 '17 at 21:56
  • Apologies for lack of clarity - essentially for each value within the dataframe I'm trying to calculate the z-score for it, compared to the group that it belongs to. So the `z-score` for Sepal.Length will not be comparable to that of the `z-score` for Sepal.Width. Is that clearer? – Worville11 Sep 12 '17 at 22:03
  • To calculate z-scores, use base function `scale`. – Rui Barradas Sep 12 '17 at 22:12

2 Answers2

10

I believe that you were complicating when computing z-scores with mean/sd. Just use function scale.

test <- test %>% 
  group_by(Species, variable) %>% 
  mutate(z_score = scale(value))

test
## A tibble: 600 x 4
## Groups:   Species, variable [12]
#   Species     variable value     z_score
#    <fctr>       <fctr> <dbl>       <dbl>
# 1  setosa Sepal.Length   5.1  0.26667447
# 2  setosa Sepal.Length   4.9 -0.30071802
# 3  setosa Sepal.Length   4.7 -0.86811050
# 4  setosa Sepal.Length   4.6 -1.15180675
# 5  setosa Sepal.Length   5.0 -0.01702177
# 6  setosa Sepal.Length   5.4  1.11776320
# 7  setosa Sepal.Length   4.6 -1.15180675
# 8  setosa Sepal.Length   5.0 -0.01702177
# 9  setosa Sepal.Length   4.4 -1.71919923
#10  setosa Sepal.Length   4.9 -0.30071802
## ... with 590 more rows

Edit.
Following a comment by the OP, I am posting some code to get the rows where Petal.Width has a positive z_score.

i1 <- which(test$variable == "Petal.Width" & test$z_score > 0)
test[i1, ]
## A tibble: 61 x 4
## Groups:   Species, variable [3]
#   Species    variable value  z_score
#    <fctr>      <fctr> <dbl>    <dbl>
# 1  setosa Petal.Width   0.4 1.461300
# 2  setosa Petal.Width   0.3 0.512404
# 3  setosa Petal.Width   0.4 1.461300
# 4  setosa Petal.Width   0.4 1.461300
# 5  setosa Petal.Width   0.3 0.512404
# 6  setosa Petal.Width   0.3 0.512404
# 7  setosa Petal.Width   0.3 0.512404
# 8  setosa Petal.Width   0.4 1.461300
# 9  setosa Petal.Width   0.5 2.410197
#10  setosa Petal.Width   0.4 1.461300
## ... with 51 more rows
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • If you just filter down to `Petal.Width` though on that, all of the z-scores are negative - which doesn't make sense. – Worville11 Sep 13 '17 at 18:20
  • @Worville11 Actually, no, there are 61 rows with positive `z_score` values. See my edit for a way to get them. – Rui Barradas Sep 13 '17 at 22:51
  • 1
    @Worville11 Also, `scale` is a base R function and therefore thoroughly tested. It does what `mean/sd` does, but with more flexibility. See its help page `?scale` for details. – Rui Barradas Sep 13 '17 at 23:00
8

Your code is giving you z-scores by group. It seems to me these z-scores should be comparable exactly because you've individually scaled each group to mean=0 and sd=1, rather than scaling each value based on the mean and sd of the full data frame. For example:

library(tidyverse)

First, set up the melted data frame:

dat = iris %>% 
  gather(variable, value, -Species) %>%
  group_by(Species, variable) %>% 
  mutate(z_score_group = (value - mean(value)) / sd(value)) %>%   # You can also use scale(value) as pointed out by @RuiBarradas
  ungroup %>% 
  mutate(z_score_ungrouped = (value - mean(value)) / sd(value)) 

Now look at the first three rows and compare with direct calculation:

head(dat, 3)

#   Species     variable value z_score_group z_score_ungrouped
# 1  setosa Sepal.Length   5.1     0.2666745         0.8278959
# 2  setosa Sepal.Length   4.9    -0.3007180         0.7266552
# 3  setosa Sepal.Length   4.7    -0.8681105         0.6254145

# z-scores by group
with(dat, (value[1:3] - mean(value[Species=="setosa" & variable=="Sepal.Length"])) / sd(value[Species=="setosa" & variable=="Sepal.Length"]))

# [1]  0.2666745 -0.3007180 -0.8681105

# ungrouped z-scores
with(dat, (value[1:3] - mean(value)) / sd(value))

# [1] 0.8278959 0.7266552 0.6254145

Now visualize the z-scores: The first graph below is the raw data. The second is the ungrouped z-scores--we've just rescaled the data to an overall mean=0 and SD=1. The third graph is what your code produces. Each group has been individually scaled to mean=0 and SD=1.

gridExtra::grid.arrange(
  grobs=setNames(names(dat)[c(3,5,4)], names(dat)[c(3,5,4)]) %>% 
    map(~ ggplot(dat %>% mutate(group=paste(Species,variable,sep="_")), 
                 aes_string(.x, colour="group")) + geom_density()),
  ncol=1)

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thanks, nice graphs. Your first snippet of code for me is just returning the same score for the `group` and and `ungrouped` versions. Any ideas as to why? – Worville11 Sep 13 '17 at 18:25
  • I'm not sure. Can you update your question with the exact code you're running? – eipi10 Sep 22 '17 at 05:34
  • Hello @eipi10, all my graphs looks the same from your example (they look like z_score_ungrouped) :( – CaseebRamos Sep 21 '20 at 17:31
  • There is a conflict between `group_by` function in `plyr` and `dplyr`. You should remove `plyr` to use `group_by` in `dplyr` or use `dplyr::group _by`. For more information check: https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with – Mohieddin Jafari Nov 05 '20 at 11:13