I will start by saying that I am fully aware that similar questions have been answered before, but after hours of reading and troubleshooting, I believe I have a unique issue. Apologies if I have missed something. The answer given in the much up-voted similar question points to NAs in the data, but as explained in my question, I do not seem to have any nor do I know where they may be popping up.
I am running a for-loop in R 4.1.2 using the lubridate, readr, and dplyr packages that seeks to mark as invalid data taken by individuals before they have passed a reliability test. Tests are unique to specific groups, so an individual may be reliable for one group, many, all, or none. The function I've written is meant to take a dataframe "x" and for each individual observer, check that the data point is valid against a dataframe "key" that has a column of observers (observer), test pass date (begin_valid), and the group they are now valid for (group_valid). The key may have multiple rows per observer if they have passed multiple tests. I've used tools from the Lubridate package to create POSIXct values for the dates that can be arithmetically manipulated and compared to each other. The user can set y = "remove" if they want to remove invalid data, or leave if they want to label and keep invalid data. Here is the code:
invalidata <- function(x, y){
library(lubridate)
library(readr)
library(dplyr)
x$valid <- rep(1, length(rownames(x)))
alts <- 0
key <- read_csv("updatable csv file")
key$begin_valid <- parse_date_time(key$begin_valid, c("mdy", "dmy", "ydm", "mdy"), tz= "Africa/Lubumbashi")
for(i in unique(x$observer)){
subkey <- subset(key, key$observer == i)
subx <- subset(x, x$observer == i)
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){ #if reliable for nothing, remove
x[x$observer == i]$valid <- 0
print("removed completely unreliable")
}else{
for(j in rownames(subx)){
if(subx$group[j] %in% subkey$group_valid == FALSE && "All" %in% subkey$group_valid == FALSE){ #if not reliable for specific group or all groups, remove
x$valid[j] <- 0
print("removed unreliable for group")
}
if(subx$group[j] %in% subkey$group_valid){ #remove if before reliability date for group
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
} else{ #remove if not reliable for specific group, and before reliability date for all
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid){
x$valid[j] <- 0
print("removed pre-reliability")
}
}
}
}
}
if(y == "remove"){ #remove all invalid data and validity column
x <- subset(x, x$valid == 1)
x <- select(x, -valid)
}
return(x)}
My issue is with the line
if(subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid)
which returns the error:
Error in if (subx$date[j] < subset(subkey, subkey$group_valid == >"All")$begin_valid) { : missing value where TRUE/FALSE needed
However, when I run the code inside the parentheses
subx$date[j] < subset(subkey, subkey$group_valid == "All")$begin_valid
outside of the context of the loop, I receive either a TRUE or FALSE value as relevant. I've checked all dates for any NULL or NA values, as well as addressed any data with NAs in a previous step of the code:
if(is.na(subkey$begin_valid) == TRUE || is.na(subkey$group_valid) == TRUE){}
else{ #code at issue }
I am not having issues with this very similar line:
if(subx$date[j] < subset(subkey, subkey$group_valid == subx$group[j])$begin_valid){
My best guess is that something may be going wrong with the date formatting? I know that this error is usually a symptom of NULLs or NAs floating in the data, but for the life of me I cannot figure out where they could be coming from. Dates in "x" have already been parsed and contain no NAs or NULLs. I have not included the data as it is proprietary, but I can come up with mock data if people are interested/think it would be necessary. Thank you in advance for reading through and for any thoughts/troubleshooting suggestions!
MRE:
dput output for x:
structure(list(date = structure(c(1486764000, 1486764000, 1486850400,
1486936800, 1487023200, 1487109600, 1487109600, 1487196000, 1487196000,
1487368800, 1487368800, 1487368800, 1487368800, 1487368800, 1487368800,
1487455200, 1487455200, 1487455200, 1487541600, 1487887200), class = c("POSIXct",
"POSIXt"), tzone = "Africa/Lubumbashi"), time = structure(c(23734,
53419, 41352, 33034, 24220, 34812, 35624, 27949, 27950, 49192,
49286, 49392, 49401, 62719, 62725, 26046, 26047, 27246, 46611,
61228), class = c("hms", "difftime"), units = "secs"), observer = c("MA",
"LE", "VI", "VI", "MI", "MA", "MA", "ME", "VI", "BA", "MA", "BA",
"MA", "ME", "MI", "MA", "BA", "MI", "BA", "MA"), group = c("EKK",
"EKK", "KKL", "EKK", "KKL", "KKL", "KKL", "EKK", "EKK", "EKK",
"EKK", "EKK", "EKK", "KKL", "KKL", "EKK", "EKK", "KKL", "EKK",
"KKL")), row.names = c(NA, -20L), spec = structure(list(cols = list(
date = structure(list(), class = c("collector_character",
"collector")), time = structure(list(format = ""), class = c("collector_time",
"collector")), observer = structure(list(), class = c("collector_character",
"collector")), group = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001f6f2f7af70>, class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))
for the key:
structure(list(observer = c("BA", "MI", "VI", "ME", "DA", "OK",
"FR", "MA", "LA", "DE", "JD", "JD", "JD", "BR", "DA", "DA", "PA",
"PA", "JA", "JE", "DI", "JP", "LE", "MR", "NG", "TR", "TE"),
begin_valid = c("8/12/2016", "12/21/2019", "8/11/2016", "8/11/2016",
"12/11/2019", "12/17/2019", "12/11/2019", "11/2/2016", "1/11/2020",
"12/12/2019", "12/16/2019", "12/16/2019", "11/22/2020", "6/19/2021",
"11/26/2020", "11/26/2020", "7/25/2021", "7/25/2021", NA,
NA, NA, NA, NA, NA, NA, NA, NA), group_valid = c("All", "All",
"All", "All", "All", "All", "FKK", "All", "FKK", "FKK", "EKK",
"KKL", "All", "EKK", "EKK", "KKL", "EKK", "KKL", NA, NA,
NA, NA, NA, NA, NA, NA, NA), subgroup = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, "S", NA, NA, NA, "S", NA, "N",
NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, -27L
), spec = structure(list(cols = list(observer = structure(list(), class = c("collector_character",
"collector")), begin_valid = structure(list(), class = c("collector_character",
"collector")), group_valid = structure(list(), class = c("collector_character",
"collector")), subgroup = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"))