Calculating means from tall data or wide data in R

Question

I'm a beginner-intermediate R user that started learning R for laboratory research a few months ago. Thanks for your patience---especially if this ends up being a really stupid simple problem.

Problem

The tables as a reproducible example

The following code generates tables similar to my set, first as tall data, second as wide data.

library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.4
library(tidyr)
#> Warning: package 'tidyr' was built under R version 3.4.4

tall <- tibble(X=c(3999.387, 3999.387, 3999.387,
                     3999.066, 3999.066, 3999.066,
                     3998.745, 3998.745, 3998.745,
                     3998.423, 3998.423, 3998.423,
                     3998.102, 3998.102, 3998.102), 
       Y=rnorm(15, mean=2, sd=1), 
       S=c("s1","s2","s3","s1","s2","s3","s1","s2","s3","s1","s2","s3","s1","s2","s3"))
head(tall)
#> # A tibble: 6 x 3
#>       X     Y S    
#>   <dbl> <dbl> <chr>
#> 1 3999. 3.07  s1   
#> 2 3999. 1.81  s2   
#> 3 3999. 4.02  s3   
#> 4 3999. 1.21  s1   
#> 5 3999. 0.771 s2   
#> 6 3999. 2.39  s3

wide <- spread(tall,X,Y)
head(wide)
#> # A tibble: 3 x 6
#>   S     `3998.102` `3998.423` `3998.745` `3999.066` `3999.387`
#>   <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 s1         0.454      1.50       1.84       1.21        3.07
#> 2 s2         2.04       0.392      1.50       0.771       1.81
#> 3 s3         1.38       0.992      0.790      2.39        4.02

^{Created on 2018-11-08 by the reprex package (v0.2.1)}

In the tall version, each unique value of X gets repeated for however many unique values of S there are. There are 5 unique X and 3 unique S. This is much more apparent in the wide data. In my real set I have 8010 unique X and 312 unique S. The tall data is nice because I can easily plot X vs Y and get one plotted line for each S.

The Question

What if I want to average all of the Ys at each unique value of X? It would look like this:

> # A tibble: 5 x 2
>       X     Y
>   <dbl>     <dbl>
> 1 3998.102  2.29
> 2 3998.423  1.63
> 3 3999.745  1.36
> 4 3999.066  1.66
> 5 3999.387  1.33

In this case I used the wide table, calculated the mean of each X column, and then manually constructed a new table.

Can I do this with map() functions from purrr? The documentation was confusing, probably because I have never used lapply() functions before.

Thanks for reading. I have a feeling this is really simple for most experienced users.

Ok, the question is well posed, but maybe way too long. See if this gets you what you want: `library(dplyr); tall %>% group_by(X) %>% summarise(mean_y = mean(Y))` — RLave, Nov 08 '18 at 16:26
If this is correct I'll add it as an answer and explain it in detail, I just wanted to make sure that's what you need. — RLave, Nov 08 '18 at 16:27
Try to remove everything in eccess in the question, it'll help others too. — RLave, Nov 08 '18 at 16:28
Thanks, I'll adapt it your solution to my large set and see if it works. It seems like it might. I've also shorted the post. thanks, it was definitely too long. — M. L., Nov 08 '18 at 16:31
Just tried this on my large data set and it works! Thanks very much for your response! — M. L., Nov 08 '18 at 18:09

score 1 · Accepted Answer · answered Nov 08 '18 at 16:28

What you're looking for is the dplyr package, which is at the core of the tidyverse. I'll show you how to achieve what you're trying to do with it, but there are tons of tutorials for it on-line, and it's really quite straightforward once you understand how to use it.

require(dplyr)
group_by(tall,X) %>%
  summarize(meanY=mean(Y))

First, you can tell dplyr to do everything you want as if your data is broken into seperate data.frames based on the grouping column, in this case X. Also, note that with dplyr you can "pipe" commands using %>%, which means the result of one command will be transferred to next one as its first argument, so you don't need to assign it every time or nest all your commands.

The second line creates a new table, where for every group (based on its X), it calculates the mean of all the Ys. The result is this:

# A tibble: 5 x 2
      X meanY
  <dbl> <dbl>
1 3998. 0.781
2 3998. 1.81 
3 3999. 1.37 
4 3999. 2.01 
5 3999. 2.02

And that's it. You're done. It's really powerful, simple and easy to learn. Another package you can use is data.table, but I find that it's powerfulness and conciseness comes at the expense of being a lot harder to learn (for me, anyway). It may require more lines to do things with dplyr, but it's easier for me to puzzle through the steps I need to take to achieve anything.

Good luck!

Wow. This was way easier than how I was trying to do it. You and @RLave had the same solution. I have used `dplyr` before for `filter()` and `select()`, but did not know about `group_by() %>% summarise()`. Thank you so much for your response. — M. L., Nov 08 '18 at 18:13
Great. Happy I could help. Look more at `summarise`, as well as `mutate` (to create new variables through calculation), and the rest of dplyr's tools. They're very useful and quite simple to understand. Don't forget to upvote and and accept! — iod, Nov 08 '18 at 18:15

Calculating means from tall data or wide data in R

Problem

The tables as a reproducible example

The Question

1 Answers1