How do I get an average for each replicate for each gene?

Question

I am new to R and coding and so this might be a very obvious answer!

I have a data set with log2 values for four daphnia replicates for thousands of gene probes, corresponding to various genes (as shown in the image). However, for each replicate I want to get an average expression for each gene. Is there a way I can do this?

Here's the top of my data frame:

s_MC13_B1_Cd.Ni    s_MC13_B2_Cd.Ni    s_MC13_B3_Cd.Ni    s_MC13_B4_Cd.Ni   
[1,] "3.32737034165695" "3.30082063716602" "3.35288781669471" 
"3.28130201442409"
[2,] "2.99677521546021" "2.97525202994054" "3.01357652548303" 
"2.98091704146676"
[3,] "3.22057255739705" "3.24001410852619" "3.19806113996704" 
"3.17850023932788"
[4,] "3.17934205285383" "3.22237873890637" "3.20299332433795" 
"3.19533925098426"
[5,] "3.20285957796094" "3.22659173854477" "3.22878128735342" 
"3.21307289097597"
[6,] "3.16945922109561" "3.1672329312015"  "3.17366131274743" 
"3.18792397254863"

[1,] "GENE:JGI_V11_100009"
[2,] "GENE:JGI_V11_100009"
[3,] "GENE:JGI_V11_100036"
[4,] "GENE:JGI_V11_100036"
[5,] "GENE:JGI_V11_100036"
[6,] "GENE:JGI_V11_100044"

Basically I want to get an average of each column for each gene (column 5) - for example i want to get an average of the first 2 rows (GENE:JGI_V11_100009) for each column, and do this for every gene in column 5

not sure what you're doing here, but you need to provide a [reproducible example](https://stackoverflow.com/help/mcve) in order for people to help figure out your problem. In addition to a reproducible example, you'll need to provide your expected output. We don't know what you mean by replicate, and log2 values, so example data is key. try using `dput(head(df, 30))` or something to reproduce. — Matt W., Jan 18 '18 at 20:19
Please do not post *images* of data, just the data itself. Good refs exist for making MWEs, such as https://stackoverflow.com/questions/5963269/ and https://stackoverflow.com/help/mcve. (Quickie: `dput(head(x))` (as suggested by @MattW.) is a good start.) — r2evans, Jan 18 '18 at 20:38
Your use of `aggregate` is wrong. From [`?aggregate`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/aggregate.html), `by=` is *"a list of grouping elements"*, but you are providing simply a character string (`"GENE_ID"`), try `data$GENE_ID` there instead. — r2evans, Jan 18 '18 at 20:41
Your data looks like strings, not numbers. Have you performed any sort of data cleaning/manipulation before trying to aggregate? — r2evans, Jan 18 '18 at 20:42
Back to the head of your data. just a pro tip - post the actual output of using `dput(head(data))`. - it will have structure and other code in it that we can just plug into R and recreate the dataframe as you see it. please edit your question with that! — Matt W., Jan 18 '18 at 21:19
`aggregate(.~GENE_ID,data,mean)` or `by(data[-5],data$GENE_ID,colMeans)`. To use `tapply` you need to be creative: `tapply(unlist(data[-5]),list(rep(data$GENE_ID,ncol(data[-5])),col(data[-5])),mean)` and whatever you are trying to do is `aggregate(data[-5],data[5],mean)` — Onyambu, Jan 19 '18 at 00:21

score 0 · Answer 1 · answered Jan 18 '18 at 21:32

I think I understand what you're trying to do, but with the correct data, I would be more certain.

using dplyr package:

We can rename the V5 column to be Gene to clean up the data slightly.

Then we want to change all the columns starting with "s_MC13" to be numeric. It looks like they're currently character strings.

Lastly we group_by the gene, and summarise_at which pushes the mean function across all the columns so you get a mean for each column.

library(dplyr)

data_averages <- data %>%
    rename(Gene = V5) %>%
    mutate_at(vars(starts_with("s_MC13")), funs(as.numeric)) %>%
    group_by(Gene) %>%
    summarise_at(vars(starts_with("s_MC13")), funs(mean))

How do I get an average for each replicate for each gene?

1 Answers1