I have various .csv files. Each file has multiple columns. I need to do a quality check using a formula (i.e., median-5IQR and median+5IQR) (where IQR is interquartile range). As per the formula, for a particular column, rows which have values below 'median-5IQR' and beyond 'median+5IQR', should be labelled as 'invalid' and those within range should be labelled as 'valid'.
I am able to find IQR and limits for a csv file using the given code in R:
setwd("D:/Test")
file <- read.csv("D:/Test/Data1.csv")
summary(file)
Q <- IQR(file$Press, na.rm = TRUE) #Press is name of a column
M <- median(file$Press, na.rm = TRUE)
IQR_low <- M-5*Q
IQR_high <- M+5*Q
I do not know how to proceed further to set these limits (IQR_low and IQR_high) on my data to achieve the desired goal. Also, I want to do it for all the columns and for all the csv files together.
Here is another code that I am using successfully on all csv files for all columns for another quality check (to find number of valid and null values). I am not able to adapt the following code according to my present goal. Could anyone please help me in modifying the existing code given below to achieve the objective defined above.
library(readr)
setwd("D:/Test/")
for(files in list.files(pattern=".*csv$")) {
file <- read_csv(files)
cat("\n",files,"\n")
sapply(colnames(file), function(x) {
cat(
paste0(x, ":"),
sum(!is.na(file[, x])),
"valid,",
sum(is.na(file[, x])),
"null\n")
})
}
My sample data:
D_T Temp (°C) Press (Pa) Vel (m/s) ...
2021-03-01 00:00:00+00 28 1018 110 ...
2021-03-02 00:00:00+00 27 1017 111 ...
2021-03-03 00:00:00+00 28 1019 110 ...
..
..
Edit: I tried to read multiple csv files into a single dataframe using following code:
set.seed(1234)
install.packages("data.table")
library(data.table)
df <- list.files(path = "D:/Test/", pattern = '.csv') %>%
map_df(~fread(.))
df
dt <- data.frame(df)
res_dt1 <- data.frame(matrix(ncol = ncol(dt1), nrow = nrow(dt1)))
colnames(res_dt1) <- paste0(colnames(dt1), "_result")
for(files in dt1) {
for (i in 1:ncol(dt1)) {
Q <- IQR(dt1[[i]], na.rm = TRUE)
M <- median(dt1[[i]], na.rm = TRUE)
result <- ifelse(
test = dt1[[i]] < M - 3 * Q | dt1[[i]] > M + 3 * Q, # the test condition
yes = "Valid", # the value of result if test is true
no = "Invalid" # value of result if test is false
)
res_dt1[[i]] <- result
}
}
table(res_dt1$df_result)
But it is showing < table of extent 0 > I am naive in R. Could anyone please help me in progressing with nested loops as suggested by Reed.