0

I have a csv file with a data set of experimental values of many samples, and sometimes replicates of the same sample. For the replicates I only take into account the mean value of the replicates belonging to the same sample. The problem is, the number of replicates varies, it can be 2, 3, 4 etc...

My code isn't right, because it should be only working if replicates number is 2 (since I am using a loop to compare one sampleID to the previous sampleID in the loop). Plus, my code doesn't work, it adds the same average value to all my samples, which is not right. I think there is a problem at the start of the loop too. Because when x=1, x-1=0 which doesn't correspond to any value, so that may cause the code to not work? I am a beginner in R, I never had any courses or training I am training to learn it by myself, so thank you in advance for your help.

My dataset:

enter image description here

Expected output:

enter image description here

PS: in this example the replicates number is 2. However, it can be different depending on samples, sometimes its 2, sometimes 3, 4 etc...

for (x in length(dat$Sample)){
  if (dat$Sample[x]==dat$Sample[x-1]){
    dat$Average.OD[x-1] <- mean(dat$OD[x], dat$OD[x-1])
    dat$Average.OD[x] <- NA
  }
}
  • Hi Amelie, can you attach a sample data-set you are working on and the final output you are expecting? – ealbsho93 Jul 19 '19 at 18:35
  • `for (x in length(dat$Sample))` only loops once, when `x` is equal to `length(dat$Sample)`, it should be `2:length(dat$Sample)`. Also: see if [this](https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame) is what you want. – Rui Barradas Jul 19 '19 at 18:38
  • Try `ave(dat$Sample, dat$Sample, FUN = mean, na.rm = TRUE)` or `tapply(dat$Sample, dat$Sample, FUN = mean, na.rm = TRUE)` – Rui Barradas Jul 19 '19 at 18:41

2 Answers2

1

You can do this without any looping using aggregate and merge. Since you do not provide any data, I illustrate with a simple example.

## Example data
set.seed(123)
Sample = round(runif(10), 1)
OD = sample(4, 10, replace=T)
dat = data.frame(OD, Sample)

Means = aggregate(dat$Sample, list(dat$OD), mean, na.rm=T)
names(Means) = c("OD", "mean")
Means
  OD      mean
1  1 0.9000000
2  2 0.7000000
3  3 0.3666667
4  4 0.4000000

merge(dat, Means, "OD")
   OD Sample      mean
1   1    0.9 0.9000000
2   1    0.9 0.9000000
3   2    0.8 0.7000000
4   2    0.9 0.7000000
5   2    0.4 0.7000000
6   3    0.0 0.3666667
7   3    0.6 0.3666667
8   3    0.5 0.3666667
9   4    0.3 0.4000000
10  4    0.5 0.4000000
G5W
  • 36,531
  • 10
  • 47
  • 80
  • Thank you for your reply. I had to do some adjustments but your code works. `Means <- aggregate(dat$OD, list(dat$Sample), mean, na.rm = T) names(Means) = c("Sample", "Mean_OD") merge(dat, Means, by = "Sample")` As for merge, it only prints out the values on the R console space. I would like the values to be added to a column that already exists in my data frame. Is that possible? If yes which function should I use? – Armelle Quinn Jul 22 '19 at 13:51
1

Let me show you the possible solution by data.table.

#Data
data <- data.frame('Sample'=c('Blank','Blank','STD1','STD1'), 
                             'OD'=c(0.07,0.08,0.09,0.10))

#Code
#Converting our data to data.table.
setDT(data)

#Finding the average of OD by Sample Column. Here Sample Column is the key.If you want it by both Sample and Replicates, pass both of them in by and so on.
data[, AverageOD := mean(OD), by = c("Sample")]

#Turning all the duplicate AverageOD values to NA.
data[duplicated(data, by = c("Sample")), AverageOD := NA] 

#Turning column name of AverageOD to  Average OD
names(data)[which(names(data) == "AverageOD")] = 'Average OD'

Let me know if you have any questions.

ealbsho93
  • 141
  • 4
  • This works perfectly, thank you! However, I already have a column in my csv file for the Average_OD. I tried to use your code by changing "AverageOD" by "Average OD" which is the name of the column I have in my file and this error message pops out : `Error in `[.data.table`(data, , `:=`(Average OD, mean(OD)), by = c("Sample")) : Type of RHS ('double') must match LHS ('logical'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)` – Armelle Quinn Jul 23 '19 at 09:15
  • I managed to avoid the error above by just leaving AverageOD, which adds me a new column, then I replace the column I have in my file using `data$Average OD <- data$AverageOD` , then I delete the new column using `data$AverageOD <- NULL`. Is this a "correct" way to handle this kind of situation? – Armelle Quinn Jul 23 '19 at 09:23
  • Also, could you please explain why did you use ":=" and not = for example? I can't find much info about its use. Thank you. – Armelle Quinn Jul 23 '19 at 14:08
  • Hi Amelle, I have attached the updated code to assign Average OD as the column name. := is a specific operator which we can use to create a new column when aggregating values in data.table by group. Please look at data.table cheat sheet for more uses. Meanwhile if this solution works for you, please acknowledge the answer. Thanks. – ealbsho93 Jul 23 '19 at 17:14
  • Hello @ealbsho93 , Is it possible instead of introducing in new column (AverageOD) to just add the calculated means to a one that already exists in the dataset? Thank you for your help. – Armelle Quinn Jul 24 '19 at 11:36
  • Hi Armelle,, It might take more time to lookup this than the original solution.:) let me know if its absolutely necessary. – ealbsho93 Jul 24 '19 at 20:05
  • Actually it is necessary, because I work on a database that is being updated, so the previous sample already have the mean calculated and listed in a column. Please let me know if you think of something. Thank you – Armelle Quinn Jul 26 '19 at 15:58