0

I am new to R and coding and so this might be a very obvious answer!

I have a data set with log2 values for four daphnia replicates for thousands of gene probes, corresponding to various genes (as shown in the image). However, for each replicate I want to get an average expression for each gene. Is there a way I can do this?

RStudio Console Screenshot

Here's the top of my data frame:

s_MC13_B1_Cd.Ni    s_MC13_B2_Cd.Ni    s_MC13_B3_Cd.Ni    s_MC13_B4_Cd.Ni   
[1,] "3.32737034165695" "3.30082063716602" "3.35288781669471" 
"3.28130201442409"
[2,] "2.99677521546021" "2.97525202994054" "3.01357652548303" 
"2.98091704146676"
[3,] "3.22057255739705" "3.24001410852619" "3.19806113996704" 
"3.17850023932788"
[4,] "3.17934205285383" "3.22237873890637" "3.20299332433795" 
"3.19533925098426"
[5,] "3.20285957796094" "3.22659173854477" "3.22878128735342" 
"3.21307289097597"
[6,] "3.16945922109561" "3.1672329312015"  "3.17366131274743" 
"3.18792397254863"

[1,] "GENE:JGI_V11_100009"
[2,] "GENE:JGI_V11_100009"
[3,] "GENE:JGI_V11_100036"
[4,] "GENE:JGI_V11_100036"
[5,] "GENE:JGI_V11_100036"
[6,] "GENE:JGI_V11_100044"

Basically I want to get an average of each column for each gene (column 5) - for example i want to get an average of the first 2 rows (GENE:JGI_V11_100009) for each column, and do this for every gene in column 5

Parfait
  • 104,375
  • 17
  • 94
  • 125
A.Carter
  • 49
  • 8
  • 5
    not sure what you're doing here, but you need to provide a [reproducible example](https://stackoverflow.com/help/mcve) in order for people to help figure out your problem. In addition to a reproducible example, you'll need to provide your expected output. We don't know what you mean by replicate, and log2 values, so example data is key. try using `dput(head(df, 30))` or something to reproduce. – Matt W. Jan 18 '18 at 20:19
  • Here's the top of my data frame: – A.Carter Jan 18 '18 at 20:35
  • Please do not post *images* of data, just the data itself. Good refs exist for making MWEs, such as https://stackoverflow.com/questions/5963269/ and https://stackoverflow.com/help/mcve. (Quickie: `dput(head(x))` (as suggested by @MattW.) is a good start.) – r2evans Jan 18 '18 at 20:38
  • Your use of `aggregate` is wrong. From [`?aggregate`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/aggregate.html), `by=` is *"a list of grouping elements"*, but you are providing simply a character string (`"GENE_ID"`), try `data$GENE_ID` there instead. – r2evans Jan 18 '18 at 20:41
  • Your data looks like strings, not numbers. Have you performed any sort of data cleaning/manipulation before trying to aggregate? – r2evans Jan 18 '18 at 20:42
  • Back to the head of your data. just a pro tip - post the actual output of using `dput(head(data))`. - it will have structure and other code in it that we can just plug into R and recreate the dataframe as you see it. please edit your question with that! – Matt W. Jan 18 '18 at 21:19
  • `aggregate(.~GENE_ID,data,mean)` – Onyambu Jan 19 '18 at 00:14
  • `aggregate(.~GENE_ID,data,mean)` or `by(data[-5],data$GENE_ID,colMeans)`. To use `tapply` you need to be creative: `tapply(unlist(data[-5]),list(rep(data$GENE_ID,ncol(data[-5])),col(data[-5])),mean)` and whatever you are trying to do is `aggregate(data[-5],data[5],mean)` – Onyambu Jan 19 '18 at 00:21

1 Answers1

0

I think I understand what you're trying to do, but with the correct data, I would be more certain.

using dplyr package:

We can rename the V5 column to be Gene to clean up the data slightly.

Then we want to change all the columns starting with "s_MC13" to be numeric. It looks like they're currently character strings.

Lastly we group_by the gene, and summarise_at which pushes the mean function across all the columns so you get a mean for each column.

library(dplyr)

data_averages <- data %>%
    rename(Gene = V5) %>%
    mutate_at(vars(starts_with("s_MC13")), funs(as.numeric)) %>%
    group_by(Gene) %>%
    summarise_at(vars(starts_with("s_MC13")), funs(mean))
Matt W.
  • 3,692
  • 2
  • 23
  • 46