0

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.

For those duplicated dates, how do I average the distances then replace the original distance with the averages?

Let's use this dataset as the model:

z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
#  1   2
#  1   4
#  2   1
#  2   3
#  3   5
#  4   2

The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.

I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:

# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be 
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)

Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.

talat
  • 68,970
  • 21
  • 126
  • 157
remarkableearth
  • 801
  • 1
  • 9
  • 11
  • 2
    You should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). `ave(Rdataset$Distance, Rdataset$Epoch)` should be OK. What'wrong with `ave`? to get duplicates use `duplicated` function. – agstudy Jul 03 '13 at 00:51
  • 1
    where is the question? – flodel Jul 03 '13 at 00:55
  • #flodel: The question is, How do I calculate the mean of distances for duplicated dates within a dataset? – remarkableearth Jul 03 '13 at 01:00
  • I have added a reproducible example and clarified some points in the original post. – remarkableearth Jul 03 '13 at 01:15

4 Answers4

1

Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:

duplicated_ids = unique(z$id[duplicated(z$id)])

library(plyr)
z_deduped = ddply(
  z,
  .(id),
  function(df_section) {
    res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
  }
)

Output:

> z_deduped
  id var
1  1   3
2  2   2
3  3   5
4  4   2
Marius
  • 58,213
  • 16
  • 107
  • 105
  • 5
    Or in base R and much simpler: `aggregate(var ~ id, data=z, FUN=mean)`. Alternatively, to not throw out the duplicate `id` rows, `z$var <- ave(z$var,z$id)` – thelatemail Jul 03 '13 at 02:37
  • 1
    @thelatemail Since no one else has claimed it, you should probably make that its own answer. – Thomas Jul 03 '13 at 14:43
  • @thelatemail Your script is the best and simplest. As Thomas says, you should make it its own answer so I can check it. Thanks to all - I tried all your methods. – remarkableearth Jul 03 '13 at 16:21
0

Unless I misunderstand:

library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1  1    3
# 2  2    2
# 3  3    5
# 4  4    2
Jack Ryan
  • 2,134
  • 18
  • 26
0

Here is another answer in data.table style:

library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))

z[, mean(var), by = id]

    id V1
1:  1  3
2:  2  2
3:  3  5
4:  4  2

There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.

Uwe
  • 41,420
  • 11
  • 90
  • 134
-1
zt<-aggregate(var~id,data=z,mean)
zt
  id var
1  1   3
2  2   2
3  3   5
4  4   2