Averaging Duplicate Values in an R data frame

Question

I have a df named ColorMap in which I am looking to average all numerical values corresponding to the same feature (further explanation below). Here is the df.

> ColorMap
    KEGGnumber Colors
1   c("C00489"  0.162
2     "C06104"  0.162
3    "C02656")  0.162
4       C00163 -0.173
5   c("C02656" -0.140
6     "C00036" -0.140
7     "C00232" -0.140
8     "C01571" -0.140
9    "C00422") -0.140
10  c("C00402"  0.147
11    "C06664"  0.147
12    "C06687"  0.147
13   "C02059")  0.147
14  c("C00246"  0.069
15   "C00902")  0.069
**16      C00033  0.011
...
25      C00033 -0.073**
26      C00048  0.259
**27  c("C00803"  0.063
...
37      C00803 -0.200
38      C00803 -0.170**
39  c("C00164" -0.020
40    "C01712" -0.020
...
165 c("C00246"  0.076
166  "C00902")  0.076
**167     C00163 -0.063
...
169     C00163  0.046**
170 c("C00058" -0.208
171  "C00036") -0.208
172     C00121 -0.178
173     C00033 -0.193
174     C00163 -0.085

I would like the final to look something like this

> ColorMap
    KEGGnumber Colors
1      C00489   0.162
2      C06104   0.162
3      C02656   0.162
4      C00163  -0.173
5      C02656  -0.140
6      C00036  -0.140
7      C00232  -0.140
8      C01571  -0.140
9      C00422  -0.140
10     C00402   0.147
11     C06664   0.147
12     C06687   0.147
13     C02059   0.147
14     C00246   0.069
15     C00902   0.069
**16   C00033   0.031**
26     C00048   0.259
**27   C00803  -0.100**
39     C00164  -0.020
40     C01712  -0.020
...
165    C00246   0.076
166    C00902   0.076
**167  C00163   0.0085**
170    C00058  -0.208
171    C00036  -0.208
172    C00121  -0.178
173    C00033  -0.193
174    C00163  -0.085

They do not need to be next to each other, I simply chose those for easy visualization. I would like the mean of all Colors to a single KEGGvalue. Thus, each KEGGvalue is unique, there are no duplicates.

You should be concerned about the first column of the original data. Looks like it didn't parse correctly when you read it. — Rich Scriven, Aug 12 '16 at 23:08
Yes I know, however that can easily be corrected via bash regex and I am not as concerned with that. If you have a solution in r, however, I would love to hear it. — Zach Eisner, Aug 12 '16 at 23:11
Well, if you can take care of cleaning up the first column, the rest is a dupe [of this r-faq](http://stackoverflow.com/q/11562656/903061) — Gregor Thomas, Aug 12 '16 at 23:14
See `base::gsub` or `stringr::str_extract` depending on if you want to replace the bad with `''` or extract the good. — Gregor Thomas, Aug 12 '16 at 23:28

shayaa · Accepted Answer · 2016-08-12T23:42:53.023

You can clean that column using

library(stringr)
ColorMap$KEGGnumber <- str_extract(ColorMap$KEGGnumber, "[C][0-9]+")

The argument pattern allows you to match with a regular expression, in this case, a simple one, telling you to match the capital letter C followed by any number of numbers.

Afterwards, grouping using dplyr we have

library(dplyr)
ColorMap %>% group_by(KEGGnumber) %>% summarize(mean(Colors))

Averaging Duplicate Values in an R data frame

1 Answers1