0

I have a data frame df that has an ID-column (called SNP) in which some IDs occur more than once.

     CHR   SNP    A1 A2    MAF NCHROBS
1:    1    1:197  C  T 0.3148     314
2:    1    1:205  G  C 0.2058     314
3:    1    1:206  A  C 0.0000     314
4:    1    1:219  C  G 0.8472     314
5:    1    1:223  A  C 0.7265     314
6:    1    1:224  G  T 0.3295     314
7:    1    1:197  C  T 0.3148     314
8:    1    1:205  G  C 0.0000     314
9:    1    1:206  A  C 0.0000     314
10:   1    1:219  C  G 0.0000     314
11:   1    1:223  A  C 0.0000     314
12:   1    1:224  G  T 0.0000     314
13:   1    1:197  C  T 0.4753     314
14:   1    1:205  G  C 0.1964     314
15:   1    1:206  A  C 0.0000     314
16:   1    1:219  C  G 0.6594     314
17:   1    1:223  A  C 0.8946     314
18:   1    1:224  G  T 0.2437     314

I would like to calculate the mean and standard deviation (SD) from the values in the MAF-column that share the same ID.

df <-
    list.files(pattern = "*.csv") %>%
    map_df(~fread(.))

colMeans(df, rows=df$SNP == "1:197", cols=df$MAF)

Why is it not possible to specify values based on conditions with colMeans?

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
PolII
  • 107
  • 8

4 Answers4

4

Since you have a data.table,

df[, .(mu = mean(MAF), sigma = sd(MAF)), by = .(SNP) ]
#      SNP        mu      sigma
# 1: 1:197 0.3683000 0.09266472
# 2: 1:205 0.1340667 0.11620023
# 3: 1:206 0.0000000 0.00000000
# 4: 1:219 0.5022000 0.44493914
# 5: 1:223 0.5403667 0.47545926
# 6: 1:224 0.1910667 0.17093936

If you prefer base (despite using data.table), then

aggregate(dat$MAF, list(dat$SNP), function(a) c(mu = mean(a), sigma = sd(a)))
#   Group.1       x.mu    x.sigma
# 1   1:197 0.36830000 0.09266472
# 2   1:205 0.13406667 0.11620023
# 3   1:206 0.00000000 0.00000000
# 4   1:219 0.50220000 0.44493914
# 5   1:223 0.54036667 0.47545926
# 6   1:224 0.19106667 0.17093936
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thanks for your answer! It is great that the 'data.table' solution is a one-liner. – PolII Jun 29 '20 at 14:46
  • 1
    Technically, the way `data.table` can chain operations with repeating `[` blocks, one can do a lot on one line with `dat[...][...][...][...]` ... though the line *width* might be daunting :-) (Yes, `data.table` does this *well*.) – r2evans Jun 29 '20 at 14:47
3

Using dplyr

library(dplyr)

df %>% 
  group_by(SNP) %>% 
  summarise(mean = mean(MAF),
            sd = sd(MAF))

Gives us:

 SNP    mean     sd
  <chr> <dbl>  <dbl>
1 1:197 0.368 0.0927
2 1:205 0.134 0.116 
3 1:206 0     0     
4 1:219 0.502 0.445 
5 1:223 0.540 0.475 
6 1:224 0.191 0.171 
Matt
  • 7,255
  • 2
  • 12
  • 34
1

To answer your question as to why colMeans is not working:

  1. If you look at the doucmentation of colMeans using ?colMeans you will realize that you are passing the wrong named arguments. The docs give the following example: colMeans(x, na.rm=FALSE, dims=1). And you will realize, that it doesn't have (or takes) any arguments named rows and cols. So when you try to run your code, you will get the unused arguments error.
  2. As to the question, if it is possible to pass conditional statements in colMeans you will have to pass those statements with df, i.e. you can pass the subset of df as follows:
colMeans(df[df$SNP == "1:197", "MAF", drop=F], na.rm=F, dims=1)
  1. Note it is important to pass the argument drop=F in this case, as you are subsetting on single column. When you subset on single column, [ operator simplies the result and convert the dataframe to numeric vector. But when using drop=F, it preserves the dimension of originally passed dataframe.
  2. If a numeric vector is passed to colMeans you will get an error as the colMeans accept x to be of atleast 2 dimensions.
  3. As to the other question of how to calculate column mean, I believe others have highlighted quite nice approaches in this thread, any of those approaches work, you just have to choose one.
monte
  • 1,482
  • 1
  • 10
  • 26
0

You could use the tapply() function, given SNP is a factor:

  mean.CHR=tapply(df$SNP,df$MAF,mean)

  sd.CHR=tapply(df$SNP,df$MAF,sd)