Use tapply to generate variance for subsets of data

Question

I have a list of genes with 1-3 probes for each gene, and an intensity value for each probe. An example is as follows:

GENE_ID             Probes                  Intensity 
GENE:JGI_V11_100009 GENE:JGI_V11_1000090102 253.479375
GENE:JGI_V11_100009 GENE:JGI_V11_1000090202 712.235625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360103 449.065625
GENE:JGI_V11_100036 GENE:JGI_V11_1000360203 641.341875
GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.07125
GENE:JGI_V11_100044 GENE:JGI_V11_1000440101 456.133125
GENE:JGI_V11_100045 GENE:JGI_V11_1000450101 369.790625
GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.97375
GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.55125

I want to determine the variance between the probes for each individual gene (so for every gene I hve a variance value)

I am aware that I should use the tapply() function but dont know how to accomplish this other than:

tapply( , , var)

You could try: `tapply(df$Intensity, df$GENE_ID, FUN = var)`. In general, it looks like you are trying to do by group operations and this has been covered in a number of different Stack Overflow answers (one is https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group - the `tapply` solution is in the accepted answer). — Mike H., Mar 20 '18 at 14:14

score 0 · Accepted Answer · answered Mar 20 '18 at 13:59

You can use data.table or dplyr to accomplish this. This is a classic group_by case:

library(dplyr)
df %>% 
    group_by(GENE_ID) %>% 
    mutate(new_var = var(Intensity))


library(data.table)
setDT(df)
df[, new_var := var(Intensity), .(GENE_ID)]

Output in both the cases comes:

               GENE_ID                  Probes Intensity   new_var
1: GENE:JGI_V11_100009 GENE:JGI_V11_1000090102  253.4794  105228.6
2: GENE:JGI_V11_100009 GENE:JGI_V11_1000090202  712.2356  105228.6
3: GENE:JGI_V11_100036 GENE:JGI_V11_1000360103  449.0656  168802.8
4: GENE:JGI_V11_100036 GENE:JGI_V11_1000360203  641.3419  168802.8
5: GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712  168802.8
6: GENE:JGI_V11_100044 GENE:JGI_V11_1000440101  456.1331        NA
7: GENE:JGI_V11_100045 GENE:JGI_V11_1000450101  369.7906        NA
8: GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738 6282014.8
9: GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513 6282014.8

score 0 · Answer 2 · answered Mar 20 '18 at 14:22

This is a classic ave case from base R. While tapply returns a vector with equal length to unique values of grouping factor(s), ave returns grouping averages (or other aggregate) with same vector length of dataframe/matrix columns (repeating values as necessary by groups):

gene_df$Probes_var <- ave(gene_df$Intensity, gene_df$GENE_ID, FUN=var)
gene_df

#               GENE_ID                  Probes Intensity Probes_var
# 1 GENE:JGI_V11_100009 GENE:JGI_V11_1000090102  253.4794   105228.6
# 2 GENE:JGI_V11_100009 GENE:JGI_V11_1000090202  712.2356   105228.6
# 3 GENE:JGI_V11_100036 GENE:JGI_V11_1000360103  449.0656   168802.8
# 4 GENE:JGI_V11_100036 GENE:JGI_V11_1000360203  641.3419   168802.8
# 5 GENE:JGI_V11_100036 GENE:JGI_V11_1000360303 1237.0712   168802.8
# 6 GENE:JGI_V11_100044 GENE:JGI_V11_1000440101  456.1331         NA
# 7 GENE:JGI_V11_100045 GENE:JGI_V11_1000450101  369.7906         NA
# 8 GENE:JGI_V11_100062 GENE:JGI_V11_1000620102 2839.9738  6282014.8
# 9 GENE:JGI_V11_100062 GENE:JGI_V11_1000620202 6384.5513  6282014.8

Use tapply to generate variance for subsets of data

2 Answers2