0

I have this table:

3702    GO:0009611  0.682
3711    GO:0009611  35.418
4081    GO:0009611  18.072
3702    GO:0033554  0.400
3702    GO:0006812  0.378
3702    GO:0006412  0.373
3702    GO:0009058  0.346
3702    GO:0051641  0.312
29760   GO:0009611  28.697

I don't care about first column. Column 2 has some values repeated. What I'd like to get is a data.frame where the first column is a value of the column 2 of my initial table, and the second column of my output would be the corresponding mean of the column 3 of my initial table.

Something like:

GO:0051179  1.7398
GO:0016311  2.1595
GO:0010467  1.45633
GO:0044093  15.483
GO:0006811  2.4175
GO:0044238  0.927667
GO:0006812  3.0138
GO:0006807  1.048

In fact, I've got this output using awk:

awk '{print $2"\t"$3}' BP.txt | awk '{hash1[$1]+=$2} ; {hash2[$1]+=1} END {for (x in hash1) {print x"\t"hash1[x]/hash2[x]}}'

but no clue about doing this in R.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user2979409
  • 773
  • 1
  • 12
  • 23

4 Answers4

3

Just use tapply. So if you had a data frame dd, with three columns V1, V2 and V3, then

tapply(dd$V3, dd$V2, mean)

would give you what you want.

csgillespie
  • 59,189
  • 14
  • 150
  • 185
3

you could use data.table. If df is your data.frame, then do as following

library(data.table) ## 1.9.2+
dt <- as.data.table(df)
dt <- dt[, list(col = mean(col3)), by = col2]
Manoj G
  • 1,776
  • 2
  • 24
  • 29
2

An alternative for the tapply from @csgillespie is the by function:

by(dd$V3, dd$V2, mean)
Jaap
  • 81,064
  • 34
  • 182
  • 193
1

or Just the good old aggregate (assuming temp is your data set)

aggregate(V3 ~ V2, temp, mean)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196