1

I have a dataframe which is like the followng:

dat <- data.frame(participant = c(rep("01", times = 3), rep("02", times = 3)),
                  target = rep(c("1", "2", "3"), times = 2),
                  eucDist = c(0.06, 0.16, 0.89, 0.10, 0.11, 0.75),
                  eucDist2 = c(0.09, 0.04, 0.03, 0.05, 1.45, 0.09))

   participant target eucDist eucDist2
1          01      1    0.06     0.09
2          01      2    0.16     0.04
3          01      3    0.89     0.03
4          02      1    0.10     0.05
5          02      2    0.11     1.45
6          02      3    0.75     0.09

I have run some code to identify outliers in the eucDist and eucDist2 columns, which I have saved in separate dataframes. Examples of these can be seen here:

outliers1 <- data.frame(participant = c("01", "02"),
              target = c("1", "3"),
              eucDist = c(0.06, 0.75),
              eucDist2 = c(0.09, 0.09))

   participant target eucDist eucDist2
1          01      1    0.06     0.09
2          02      3    0.75     0.09

outliers2 <- data.frame(participant = "01",
                        target = "1",
                        eucDist = 0.06,
                        eucDist2 = 0.09)

  participant target eucDist eucDist2
1          01      1    0.06     0.09

The rows shown in Outliers1 indicate outliers in the eucDist column in dat, and the row shown in Outliers2 indicates an outlier in the eucDist2 column.

I would like to replace the outlier values in the eucDist and eucDist2 columns of datwith 'NA'. I do not want to remove whole rows because in many cases either the eucDist or eucDist2 values are usable, and removing the entire row would remove both variables.

Here is what I would like:

    participant target eucDist eucDist2
1          01      1    NA       NA
2          01      2    0.16     0.04
3          01      3    0.89     0.03
4          02      1    0.10     0.05
5          02      2    0.11     1.45
6          02      3    NA       0.09

I have been attempting this using conditional %in% statements, but can't quite get the phrasing correct and would really appreciate some help. Here is my non-working code:

library(naniar)
dat1 <- if (dat$eucDist %in% Outliers1$eucDist) {
  replace_with_na_all(dat$eucDist)
}

3 Answers3

1

As you have the data frames in this format, you can set the values of the required data to NA in the new data frames, and then update the rows of the original data frames with these values using dplyr::rows_update(). This assumes you have at least dplyr v1.0.0.

library(dplyr)

outliers1$eucDist  <- NA
outliers2$eucDist2  <- NA
dat  |>
    rows_update(outliers1, by = c("participant", "target"))  |>
    rows_update(select(outliers2, -eucDist), by = c("participant", "target")) 


#   participant target eucDist eucDist2
# 1          01      1      NA       NA
# 2          01      2    0.16     0.04
# 3          01      3    0.89     0.03
# 4          02      1    0.10     0.05
# 5          02      2    0.11     1.45
# 6          02      3      NA     0.09
SamR
  • 8,826
  • 3
  • 11
  • 33
  • thanks a lot for this, I love the code. However, this is causing me an issue in my actual dataset that was not relevant to the reproducible example I gave! So, I think what is happening is in my actual dataset, I have some rows in the outliers2 dataframe that are outliers for eucDist2 AND eucDist. Your last line of code replaces the NA values in eucDist with the actual value, essentially reversing the previous line of code for these entries. I hope that makes sense. I will update the example accordingly to help explain. – milsandhills Jul 11 '22 at 15:38
  • 1
    Update: resolved this issue by removing the 'eucDist' column from outliers2 before running your dat |> code onwards and this then worked! Thank you – milsandhills Jul 11 '22 at 15:57
1

I would recommend using case_when function in the dplyr package.

dat_final <- dat %>% 
  mutate(eucDist = case_when(
    eucDist %in% outliers1$eucDist ~ as.numeric(NA),
    T ~ eucDist
  ),
  eucDist2 = case_when(
    eucDist2 %in% outliers2$eucDist2 ~ as.numeric(NA),
    T ~ eucDist2
  ))

   > str(dat_final)
'data.frame':   6 obs. of  4 variables:
 $ participant: chr  "01" "01" "01" "02" ...
 $ target     : chr  "1" "2" "3" "1" ...
 $ eucDist    : num  NA 0.16 0.89 0.1 0.11 NA
 $ eucDist2   : num  0.09 0.04 0.03 0.05 NA 0.09
  • Nice idea but have you put `"NA"` in quotes because `NA` or `NA_real_` generates an error? The problem as that you are coercing to a character vector so you can't do numerical operations (at least without converting back which can cause errors). – SamR Jul 11 '22 at 14:18
  • @SamR A nice explanation from https://stackoverflow.com/questions/44893933/avoiding-type-conflicts-with-dplyrcase-when. I updated the codes so the two new columns are numeric values. – Chemist learns to code Jul 11 '22 at 14:30
  • This is definitely better. I still prefer my answer though! I have a feeling that comparing the numerical values of the outliers, rather than of participant and target could lead to false negatives because of floating point errors, depending how the original data frames were calculated. – SamR Jul 11 '22 at 14:33
0

rbind the two outlier frames while adding column number of respective eucDist. Then run a short and sweet for loop.

outliers <- rbind(cbind(outliers1[1:2], j=3), cbind(outliers2[1:2], j=4))

for (i in seq_len(nrow(outliers))) {
  dat[dat$participant == outliers$participant[i] & dat$target == outliers$target[i], outliers$j[i]] <- NA
}
#   participant target eucDist eucDist2
# 1          01      1      NA     0.09
# 2          01      2    0.16     0.04
# 3          01      3    0.89     0.03
# 4          02      1    0.10     0.05
# 5          02      2    0.11       NA
# 6          02      3      NA     0.09
jay.sf
  • 60,139
  • 8
  • 53
  • 110