0

I will start by saying that I am fully aware that similar questions have been answered before, but after hours of reading and troubleshooting, I believe I have a unique issue. Apologies if I have missed something. The answer given in the much up-voted similar question points to NAs in the data, but as explained in my question, I do not seem to have any nor do I know where they may be popping up.

I am running a for-loop in R 4.1.2 using the lubridate, readr, and dplyr packages that seeks to mark as invalid data taken by individuals before they have passed a reliability test. Tests are unique to specific groups, so an individual may be reliable for one group, many, all, or none. The function I've written is meant to take a dataframe "x" and for each individual observer, check that the data point is valid against a dataframe "key" that has a column of observers (observer), test pass date (begin_valid), and the group they are now valid for (group_valid). The key may have multiple rows per observer if they have passed multiple tests. I've used tools from the Lubridate package to create POSIXct values for the dates that can be arithmetically manipulated and compared to each other. The user can set y = "remove" if they want to remove invalid data, or leave if they want to label and keep invalid data. Here is the code:

invalidata <- function(x, y){
  library(lubridate)
  library(readr)
  library(dplyr)
  x$valid <- rep(1, length(rownames(x)))
  alts <- 0
  key <- read_csv("updatable csv file")
  key$begin_valid <- parse_date_time(key$begin_valid, c("mdy", "dmy", "ydm", "mdy"), tz= "Africa/Lubumbashi")
  for(i in unique(x$observer)){
    subkey <- subset(key, key$observer == i)
    subx <- subset(x, x$observer == i)
    if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){ #if reliable for nothing, remove
      x[x$observer == i]$valid <- 0
      print("removed completely unreliable")
    }else{
      for(j in rownames(subx)){
        if(subx$group[j] %in% subkey$group_valid == FALSE && "All" %in% subkey$group_valid == FALSE){ #if not reliable for specific group or all groups, remove
          x$valid[j] <- 0
          print("removed unreliable for group")
        } 
        if(subx$group[j] %in% subkey$group_valid){ #remove if before reliability date for group
          if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
            x$valid[j] <- 0
            print("removed pre-reliability")
          }
        } else{ #remove if not reliable for specific group, and before reliability date for all
          if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid){
            x$valid[j] <- 0
            print("removed pre-reliability")
          }
        }
      }
    }
  }
  if(y == "remove"){ #remove all invalid data and validity column
    x <- subset(x, x$valid == 1)
    x <- select(x, -valid)
  }
  return(x)}

My issue is with the line

if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid)

which returns the error:

Error in if (subx$date[j] < subset(subkey, subkey$group_valid == >"All")$begin_valid) { : missing value where TRUE/FALSE needed

However, when I run the code inside the parentheses

subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid

outside of the context of the loop, I receive either a TRUE or FALSE value as relevant. I've checked all dates for any NULL or NA values, as well as addressed any data with NAs in a previous step of the code:

if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){}
else{ #code at issue }

I am not having issues with this very similar line:

if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){

My best guess is that something may be going wrong with the date formatting? I know that this error is usually a symptom of NULLs or NAs floating in the data, but for the life of me I cannot figure out where they could be coming from. Dates in "x" have already been parsed and contain no NAs or NULLs. I have not included the data as it is proprietary, but I can come up with mock data if people are interested/think it would be necessary. Thank you in advance for reading through and for any thoughts/troubleshooting suggestions!

MRE:

dput output for x:

structure(list(date = structure(c(1486764000, 1486764000, 1486850400, 
1486936800, 1487023200, 1487109600, 1487109600, 1487196000, 1487196000, 
1487368800, 1487368800, 1487368800, 1487368800, 1487368800, 1487368800, 
1487455200, 1487455200, 1487455200, 1487541600, 1487887200), class = c("POSIXct", 
"POSIXt"), tzone = "Africa/Lubumbashi"), time = structure(c(23734, 
53419, 41352, 33034, 24220, 34812, 35624, 27949, 27950, 49192, 
49286, 49392, 49401, 62719, 62725, 26046, 26047, 27246, 46611, 
61228), class = c("hms", "difftime"), units = "secs"), observer = c("MA", 
"LE", "VI", "VI", "MI", "MA", "MA", "ME", "VI", "BA", "MA", "BA", 
"MA", "ME", "MI", "MA", "BA", "MI", "BA", "MA"), group = c("EKK", 
"EKK", "KKL", "EKK", "KKL", "KKL", "KKL", "EKK", "EKK", "EKK", 
"EKK", "EKK", "EKK", "KKL", "KKL", "EKK", "EKK", "KKL", "EKK", 
"KKL")), row.names = c(NA, -20L), spec = structure(list(cols = list(
    date = structure(list(), class = c("collector_character", 
    "collector")), time = structure(list(format = ""), class = c("collector_time", 
    "collector")), observer = structure(list(), class = c("collector_character", 
    "collector")), group = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001f6f2f7af70>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

for the key:

structure(list(observer = c("BA", "MI", "VI", "ME", "DA", "OK", 
"FR", "MA", "LA", "DE", "JD", "JD", "JD", "BR", "DA", "DA", "PA", 
"PA", "JA", "JE", "DI", "JP", "LE", "MR", "NG", "TR", "TE"), 
    begin_valid = c("8/12/2016", "12/21/2019", "8/11/2016", "8/11/2016", 
    "12/11/2019", "12/17/2019", "12/11/2019", "11/2/2016", "1/11/2020", 
    "12/12/2019", "12/16/2019", "12/16/2019", "11/22/2020", "6/19/2021", 
    "11/26/2020", "11/26/2020", "7/25/2021", "7/25/2021", NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), group_valid = c("All", "All", 
    "All", "All", "All", "All", "FKK", "All", "FKK", "FKK", "EKK", 
    "KKL", "All", "EKK", "EKK", "KKL", "EKK", "KKL", NA, NA, 
    NA, NA, NA, NA, NA, NA, NA), subgroup = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, "S", NA, NA, NA, "S", NA, "N", 
    NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -27L
), spec = structure(list(cols = list(observer = structure(list(), class = c("collector_character", 
"collector")), begin_valid = structure(list(), class = c("collector_character", 
"collector")), group_valid = structure(list(), class = c("collector_character", 
"collector")), subgroup = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))
  • 1
    Difficult by guessing, you should give us some toy data to play with, [MRE](https://stackoverflow.com/a/5963610/6574038), you know. – jay.sf Jan 04 '22 at 19:15
  • (1) When using `subset(x, ...)`, don't use `x$`, so `subset(key, key$observer == i)` becomes `subset(key, observer == i)`. It's how subset is supposed to work (and is easier to read, imho). (2) I don't think `subx$group[j]` is going to do what you want, perhaps it always returns `NA`. For instance, `mtcars$disp[ rownames(mtcars)[1] ]` is `NA`, whereas `mtcars$disp[1]` (the intuitive equivalent) is `160`. Having said that, `mtcars[rownames(mtcars)[1],"disp"]` *does* work, is that what you were expecting? Where this breaks your code is in `subx$date[j] < ...`, since `if (NA < 1)` always fails. – r2evans Jan 04 '22 at 19:17
  • (BTW, what makes you think this is associated with `lubridate`?) – r2evans Jan 04 '22 at 19:21
  • 1
    (If you fix that and the error persists, @ping me and I can reopen the question.) – r2evans Jan 04 '22 at 19:24
  • 1
    Hi @r2evans, thanks for your many comments. (1) Okay, thanks for the comment! I learned it the other way but you're right that makes more sense. (2) subx$group[j] seems to be returning the right values throughout. (3) my association with lubridate is because I thought the NAs could be coming from the parse_date_time command. (4) I will go through and look with browser, thanks. And try to make an MRE. – Becca Supple Jan 04 '22 at 19:33
  • Odd. For me, `for (j in rownames(mtcars)) if (!is.na(mtcars$disp[j])) print(j)` produces nothing, indicating it is always `NA`. I also tried it with a `tbl_df` (tibble) and a `data.table`, and they all fail. I'm curious what your `subx` looks like. – r2evans Jan 04 '22 at 19:44
  • (FYI, I can reopen when you edit your question.) – r2evans Jan 04 '22 at 19:50
  • 1
    @r2evans Looked with browse() and couldn't find any NAs. I've added example data in an edit that hopefully can elucidate somethings. Thanks for your help! – Becca Supple Jan 04 '22 at 20:15
  • How are you calling `validata`? What is `y`? `x$date` is not being parsed so it's still `character`, is that true in your data as well? – r2evans Jan 04 '22 at 20:17
  • 1
    @jay.sf thanks for your comment. Added MRE. – Becca Supple Jan 04 '22 at 20:17
  • I call `invalidata(x)` , no value for y for now because I don't want it to remove the invalid values right now. – Becca Supple Jan 04 '22 at 20:19
  • Okay. When I run your code and it errs, `i` is `"MA"` and `j` is `"1"`. When I evaluate `subx$group[j]` it is `NA`. If this is not the same for you, then perhaps there is a version difference (though I doubt it, I think this behavior has been in place for a while). – r2evans Jan 04 '22 at 20:22
  • If I fix that problem with `for(j in seq_len(nrow(subx)))`, it then fails because `x[x$observer == i]$valid` is incorrect notation, I think it should be `x$valid[x$observer == i] <- 0`. When I fix *that* error, then it runs without error (and four times prints `"removed pre-reliability"`). – r2evans Jan 04 '22 at 20:26
  • I'm not sure how it's working for me, but when I run `subx$group[j]` with `i <- "MA"` and `j <- 1`, I get the output `[1] "EKK"`. When changing the format to `subx[rownames(subx)[1],"group"]`, I get `# A tibble: 1 x 1 group 1 EKK `. I'm using 4.1.2. – Becca Supple Jan 04 '22 at 20:26
  • ***NO***. `j` is not `1`, it is `"1"`. That's the problem. `rownames(.)` is returning strings. Recommendation: never rely on row names. – r2evans Jan 04 '22 at 20:27
  • 1
    Okay, got it. I'm almost entirely self-taught here, so little things come up from time to time that I didn't know, thank you for the kind tip. Making the changes suggested in your previous comment helped! Thanks again for spending some of your day to help me out, it's much appreciated! – Becca Supple Jan 04 '22 at 20:34

1 Answers1

1

Two errors in this code:

  • Because rownames(.) returns strings, you cannot use subx$group[j]. Two options:

    1. Preferred. Use for (j in seq_len(nrow(subx))), and all of the references work without change.
    2. Keep for(j in rownames(subx)), but change all subx$ references to be akin to subx[j,"group"].
  • x[x$observer == i]$valid is wrong code, change to x$valid[x$observer == i].

After those two changes, your code runs without error, and in this example prints "removed pre-reliability" four times on the console.

When troubleshooting, you cannot intermingle subx$group[1] and subx$group["1"], they are very different, and the latter (as expected) will produce NA.

r2evans
  • 141,215
  • 6
  • 77
  • 149