Summarize function for dplyr doesn't output correct results by row for multiple columns

Question

I have a dataset with 5 columns rachis1 to rachis5 being numeric. I have 100 rows of data with names attached to each row as a factor. I want to do a summary for each row for all five columns.

head(rl)
  name rachis1 rachis2 rachis3 rachis4 rachis5
1 R04-001     2.4     2.6     2.7     3.0     2.4
2 R04-002     7.0     7.4     7.7     6.8     7.4
3 R04-003     3.5     3.7     3.9     4.1     3.8
4 R04-004     9.5     9.1     7.8     8.8     8.2
5 R04-005     3.0     3.3     3.4     3.8     3.3
6 R04-006     9.2     9.8     9.5     9.4    10.1

My code for this is.

library(dplyr)
####Rachis
RL<- rl %>%
  group_by(name) %>% 
  summarize(RL= mean(rachis1:rachis5), RLMAX = max(rachis1:rachis5),RLMIN = 
  min(rachis1:rachis5), RLSTD=sd(rachis1:rachis5),na.rm=T)
head(RL)
tail(RL)

My resulting analysis comes out as...

 head(RL)
 # A tibble: 6 x 6
  name    RL RLMAX RLMIN     RLSTD na.rm
<fctr> <dbl> <dbl> <dbl>     <dbl> <lgl>

1  R04-001   2.4   2.4   2.4        NA  TRUE
2  R04-002   7.0   7.0   7.0        NA  TRUE
3  R04-003   3.5   3.5   3.5        NA  TRUE
4  R04-004   9.0   9.5   8.5 0.7071068  TRUE
5  R04-005   3.0   3.0   3.0        NA  TRUE
6  R04-006   9.2   9.2   9.2        NA  TRUE

I was wondering why there is NA in the RLSTD(standard deviations) and the min and max are not the mix and max of the row. Is there another way to gather my descriptive statistics?

Can you show what your data looks like at the start? My guess is your problem is your use of `rachis1:rachis5`, which will be an integer sequence from the `rachis1` value to the `rachis5` value. So if `rachis1` is 4 and `rachis5`, is 6, then `rachis1:rachis5` will be `4, 5, 6`, the mean is 5, the min is 4 and the max is 6. Probably you should put your data in long format first... hard to know without seeing your data. [See here for tips on making reproducible examples](https://stackoverflow.com/q/5963269/903061) - using `dput()` to share data is very nice because it is copy/pasteable. — Gregor Thomas, Aug 01 '17 at 23:04

Justin · Answer 1 · 2017-08-02T00:05:00.807

0

I can't tell if you have duplicate row names among the 100 rows. If you do, and as you already have the data in this format and are using the tidyverse, perhaps this may work. Notice I have placed the na.rm argument within the individual statistic function calls.

 RL<- rl %>%
      group_by(name) %>% 
              summarise(RL = mean(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
                     RLMAX = max(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
                     RLMIN = min(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T),
                     RLSTD = sd(rachis1+rachis2+rachis3+rachis4+rachis5, na.rm=T))

edited Aug 02 '17 at 00:05

answered Aug 01 '17 at 23:54

Justin

1,360
12
15

RL<- rl %>% group_by(name) %>%mutate(RL =mean(c(rachis1,rachis2,rachis3,rachis4,rachis5),na.rm=T)) %>% mutate(RLMAX = max(c(rachis1,rachis2,rachis3,rachis4,rachis5), na.rm=T)) %>% mutate(RLMIN = min(c(rachis1,rachis2,rachis3,rachis4,rachis5), na.rm=T)) %>% mutate(RLSTD = sd(c(rachis1,rachis2,rachis3,rachis4,rachis5), na.rm=T)) %>% select(RL:RLSTD) %>% distinct() head(RL) – Jacob Aug 02 '17 at 00:09
Great! Sample a set of identical rows and test ito make sure the results match up. On second thought, I updated my original answer for something a little more efficient. But If the original works, it works. – Justin Aug 02 '17 at 00:12
Skeptical that the OP actually wants to add all these columns... `c()` seems better, but going to long format first would greatly simplify things. – Gregor Thomas Aug 02 '17 at 00:21
Agreed @Gregor. Some functions in dplyr, such as `mutate` can use the addition operator `+` to combine columns while other functions in the package like `select` use `c()`. Admittedly, I don't use the tidyverse much these days and can never remember... – Justin Aug 02 '17 at 00:40

score -1 · Answer 2 · answered Aug 02 '17 at 02:00

Here is the results for the summarise code with dplyr. Works great now.

name    RL RLMAX RLMIN     RLSTD
 <fctr> <dbl> <dbl> <dbl>     <dbl>
 1  R04-001  2.62   3.0   2.4 0.2489980
 2  R04-002  7.26   7.7   6.8 0.3577709
 3  R04-003  3.80   4.1   3.5 0.2236068
 4  R04-004  8.68   9.5   7.8 0.6833740
 5  R04-005  3.36   3.8   3.0 0.2880972
 6  R04-006  9.60  10.1   9.2 0.3535534

Summarize function for dplyr doesn't output correct results by row for multiple columns

2 Answers2