In R, find duplicated dates in a dataset and replace their associated values with their mean

Question

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.

For those duplicated dates, how do I average the distances then replace the original distance with the averages?

Let's use this dataset as the model:

z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
#  1   2
#  1   4
#  2   1
#  2   3
#  3   5
#  4   2

The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.

I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:

# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be 
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)

Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.

You should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). `ave(Rdataset$Distance, Rdataset$Epoch)` should be OK. What'wrong with `ave`? to get duplicates use `duplicated` function. — agstudy, Jul 03 '13 at 00:51
#flodel: The question is, How do I calculate the mean of distances for duplicated dates within a dataset? — remarkableearth, Jul 03 '13 at 01:00
I have added a reproducible example and clarified some points in the original post. — remarkableearth, Jul 03 '13 at 01:15

score 1 · Answer 1 · answered Jul 03 '13 at 01:20

1

Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:

duplicated_ids = unique(z$id[duplicated(z$id)])

library(plyr)
z_deduped = ddply(
  z,
  .(id),
  function(df_section) {
    res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
  }
)

Output:

> z_deduped
  id var
1  1   3
2  2   2
3  3   5
4  4   2

answered Jul 03 '13 at 01:20

Marius

58,213
16
107
105

5

Or in base R and much simpler: `aggregate(var ~ id, data=z, FUN=mean)`. Alternatively, to not throw out the duplicate `id` rows, `z$var <- ave(z$var,z$id)` – thelatemail Jul 03 '13 at 02:37
1

@thelatemail Since no one else has claimed it, you should probably make that its own answer. – Thomas Jul 03 '13 at 14:43
@thelatemail Your script is the best and simplest. As Thomas says, you should make it its own answer so I can check it. Thanks to all - I tried all your methods. – remarkableearth Jul 03 '13 at 16:21

Jack Ryan · Answer 2 · 2013-07-03T14:07:44.453

0

Unless I misunderstand:

library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1  1    3
# 2  2    2
# 3  3    5
# 4  4    2

edited Jul 03 '13 at 14:07

answered Jul 03 '13 at 13:57

Jack Ryan

2,134
18
26

score 0 · Answer 3 · answered Apr 20 '16 at 23:10

Here is another answer in data.table style:

library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))

z[, mean(var), by = id]

    id V1
1:  1  3
2:  2  2
3:  3  5
4:  4  2

There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.

score -1 · Answer 4 · answered Dec 12 '14 at 10:42

-1

zt<-aggregate(var~id,data=z,mean)
zt
  id var
1  1   3
2  2   2
3  3   5
4  4   2

answered Dec 12 '14 at 10:42

user4353689

1

2

Giving a bit more context like a short explanation what aggregate does is nice and more the answer even more helpful. Just a comment. – NoDataDumpNoContribution Dec 12 '14 at 12:37

In R, find duplicated dates in a dataset and replace their associated values with their mean

4 Answers4