0

I am trying to figure out how to construct a subset of unique ID’s (a vector of ID’s where each ID only appears once) that have time differences at least two standard deviations above or below the average time difference. This subset must also contain all rows from these ID’s and all columns. The subsetted dataset must also be ordered by ID.

This is the code I have attempted already:

data <- myDataset[unique(myDataset$ID) & myDataset$timeDiff >= mean_time_diff_rounded + 2 * sd_time_diff_rounded |
                       myDataset$timeDiff <= mean_time_diff_rounded - 2 * sd_time_diff_rounded,]

Usually, the unique(function) gets rid of repeated values but it is not working for some reason. I am not sure where to go from here. Additionally, to order the final subsetted dataset I'm assuming I would have to use the order() function but I am not sure how to correctly use it in this context.

Any help would be appreciated!

athena45
  • 21
  • 2
  • 1
    Please read about [how to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and update your question accordingly. Include a sample of your data by pasting the output of `dput()` into your post or `dput(head())` if you have a large data frame. Also include code you have tried, any relevant errors, and expected output. If you cannot post your data, then please post code for creating representative data. – LMc Apr 28 '23 at 19:09

1 Answers1

0
set.seed(42)
myDataset <- data.frame(ID=sample(LETTERS, size=50, replace=TRUE), time=rnorm(50))
mu <- mean(myDataset$time)
sigma <- sd(myDataset$time)
subset(myDataset, time >= (mu+2*sigma) | time <= (mu-2*sigma))
#    ID      time
# 11  X -2.414208
# 31  C -2.993090

Hrmph, that's not many ... I'll use 1*sigma for demonstration here.

subset(myDataset, time >= (mu+sigma) | time <= (mu-sigma))
#    ID       time
# 5   J  1.0351035
# 8   Z -1.7170087
# 11  X -2.4142076
# 17  T -1.3682810
# 20  O  1.4441013
# 25  E  1.5757275
# 31  C -2.9930901
# 36  K  1.3997368
# 38  V  1.3025426
# 40  H  1.0385061
# 43  V -1.0431189
# 46  E -0.9535234
unique(subset(myDataset, time >= (mu+sigma) | time <= (mu-sigma), select=ID))
#    ID
# 5   J
# 8   Z
# 11  X
# 17  T
# 20  O
# 25  E
# 31  C
# 36  K
# 38  V
# 40  H
unique(subset(myDataset, time >= (mu+sigma) | time <= (mu-sigma))$ID)
#  [1] "J" "Z" "X" "T" "O" "E" "C" "K" "V" "H"
r2evans
  • 141,215
  • 6
  • 77
  • 149