1

I have various .csv files. Each file has multiple columns. I need to do a quality check using a formula (i.e., median-5IQR and median+5IQR) (where IQR is interquartile range). As per the formula, for a particular column, rows which have values below 'median-5IQR' and beyond 'median+5IQR', should be labelled as 'invalid' and those within range should be labelled as 'valid'.

I am able to find IQR and limits for a csv file using the given code in R:

setwd("D:/Test")
file <- read.csv("D:/Test/Data1.csv")
summary(file)
Q <- IQR(file$Press, na.rm = TRUE)     #Press is name of a column
M <- median(file$Press, na.rm = TRUE)
IQR_low <- M-5*Q
IQR_high <- M+5*Q

I do not know how to proceed further to set these limits (IQR_low and IQR_high) on my data to achieve the desired goal. Also, I want to do it for all the columns and for all the csv files together.

Here is another code that I am using successfully on all csv files for all columns for another quality check (to find number of valid and null values). I am not able to adapt the following code according to my present goal. Could anyone please help me in modifying the existing code given below to achieve the objective defined above.

library(readr)    

setwd("D:/Test/")

for(files in list.files(pattern=".*csv$")) {
  file <- read_csv(files)
  cat("\n",files,"\n")
  sapply(colnames(file), function(x) {
    cat(
      paste0(x, ":"),
      sum(!is.na(file[, x])),
      "valid,",
      sum(is.na(file[, x])),
      "null\n")
  })
}

My sample data:

D_T  Temp (°C)  Press (Pa)  Vel (m/s)  ...
2021-03-01 00:00:00+00  28  1018  110  ...
2021-03-02 00:00:00+00  27  1017  111  ...
2021-03-03 00:00:00+00  28  1019  110  ...
..
.. 

Edit: I tried to read multiple csv files into a single dataframe using following code:

set.seed(1234)

install.packages("data.table")
library(data.table)

df <- list.files(path = "D:/Test/", pattern = '.csv') %>%
  map_df(~fread(.))
df

dt <- data.frame(df)

res_dt1 <- data.frame(matrix(ncol = ncol(dt1),  nrow = nrow(dt1)))

colnames(res_dt1) <- paste0(colnames(dt1), "_result")

for(files in dt1) {
for (i in 1:ncol(dt1)) {

Q <- IQR(dt1[[i]], na.rm = TRUE)
M <- median(dt1[[i]], na.rm = TRUE)

 result <- ifelse(
    test = dt1[[i]] < M - 3 * Q | dt1[[i]] > M + 3 * Q, # the test condition
    yes = "Valid", # the value of result if test is true
    no = "Invalid" # value of result if test is false
  )
  res_dt1[[i]] <- result
}

}

table(res_dt1$df_result)

But it is showing < table of extent 0 > I am naive in R. Could anyone please help me in progressing with nested loops as suggested by Reed.

  • Read my [previous comment](https://stackoverflow.com/questions/76632850/how-to-process-multiple-csv-files-for-identifying-null-values-in-r#comment135111646_76632850) to your now three-related-question streak. Once there, learn how to do it for _one_ file and then iterate that process using `lapply` or similar. The premise of 'list of tables' is good to understand and finesse. – r2evans Jul 09 '23 at 18:18
  • @r2evans Thanks. I tried lapply but no success till now. – Michael_Brun Jul 09 '23 at 22:02

1 Answers1

2

Just for the purpose of exposition, I'll first create some toy data and save it as CSV files.

set.seed(1234)
x1 <- rnorm(n = 200, mean = -300, sd = 46)
y1 <- rnorm(n = 200, mean = 4, sd = 3.2)

x2 <- rnorm(n = 140, mean = 0, sd = 4.5)
y2 <- rnorm(n = 140, mean = 54, sd = 3)
z2 <- rnorm(n = 140, mean = 10, sd = 12)

dt1 <- data.frame(x1, y1)
dt2 <- data.frame(x2, y2, z2)

readr::write_csv(dt1, "files/dt1.csv")
readr::write_csv(dt2, "files/dt2.csv")

# For completeness, clear your environment of existing objects so that reading in the data from the CSVs is meaningful.
rm(list = ls())

Now, we read in the data as a list of data frames.

# get a list of file names and their relative paths from the directory they are saved in. These paths are relative to the path of your working directory.
    # I would recommend setting to your working directory to one level above the directory of the CSV files.
    # Note, if your directory of CSV files has sub-directories that contain files you want, then add "recursive = TRUE" to the below command.
files_list <- list.files(path = "files", pattern = "*.csv", full.names = TRUE)

# read in the data
dt_list <- lapply(files_list, readr::read_csv)

# initialize a list of data frames to store the results.  
    # This generates a set of data frames with matching dimensions to the ones in dt_list.
res_list <- list()
for (i in 1:length(dt_list)) {
    res_list[[i]] <- assign(
        x = paste0("res_df", i),
        value = data.frame(
            matrix(
                ncol = ncol(dt_list[[i]]),
                nrow = nrow(dt_list[[i]])
            )
        )
    )
}

Finally, we use two nested for-loops to complete your test and assign the results to the results data frames in res_list.

# first, loop over the list of data frames
for (i in 1:length(dt_list)) {

    # next, loop over the columns of the ith data frame, performing the test on each
    for (j in 1:ncol(dt_list[[i]])) {

        # get the bounds of your test
        Q <- IQR(dt_list[[i]][[j]], na.rm = TRUE) # this is indexing the jth column of the ith data frame. 
        M <- median(dt_list[[i]][[j]], na.rm = TRUE)

        # set an ifelse statement to categorize the values of each columns according to the test 
        result <- ifelse(
            test = dt_list[[i]][[j]] < M - 5 / Q | dt_list[[i]][[j]] > M + 5 / Q, # the test condition
            yes = "Valid", # The value of the result if the above test is "TRUE"
            no = "Invalid" # The value if the test is "FALSE"
        )
        res_list[[i]][[j]] <- result
    }
}

# View the result, of, for example, the second column of the first CSV file in files_list.
res_list[[1]][[2]]

This should produce the desired output. Keep in mind that for large amounts of data this could take a while to run. I used for loops here because I think they are more intuitive than the apply family of functions when doing more complex indexing like this, which makes them a good place to get started with programming a solution. That being said, one downside is that on very large amounts of data you'll probably notice a big boost in processing speed if you implemented something similar to this using purrr or the apply families of functions because they are vectorized.

Reed Merrill
  • 117
  • 1
  • 10
  • Thanks Reed for your response. Could you please modify your code to show me how to do it for multiple files. I am naive in R and not proficient while dealing with multiple files and for loops. – Michael_Brun Jul 09 '23 at 13:44
  • No problem. I just expanded the answer to work with a directory of multiple CSV files. – Reed Merrill Jul 10 '23 at 02:56
  • Thank you very much Reed. Could you please tell me how I can export the results of all the csv files to one log file. – Michael_Brun Jul 10 '23 at 16:24
  • You're welcome. Exporting these results would depend on the format you're looking for, and the format of the original data. Do they all have the same columns? If so, this question should help: https://stackoverflow.com/questions/26967727/how-to-write-multiple-dataframe-to-a-single-csv-file-in-a-loop-in-r – Reed Merrill Jul 10 '23 at 21:34
  • The input files are in .csv but I want the results to be stored in a log file. The number of columns vary in input files. – Michael_Brun Jul 10 '23 at 22:31
  • Sorry, but I'm no sure what you mean. Putting multiple tables into a single file would require some sort of structure that isn't common in most file types, and there are many types of log files. I'd try using an excel package for R that puts each table in a separate sheet, but I think that's beyond the scope of this question. – Reed Merrill Jul 11 '23 at 01:08
  • 1
    Thanks, but I do not need excel sheet, so please do not invest your time in it. I understand it is beyond the scope of this question. I need a log file containing all the results, e.g. Table 1 containing results of all columns of File 1, then Table 2 containing results of File 1.. and so on. I will look for it. If I need your help, I will contact you here. Thank you very much. – Michael_Brun Jul 11 '23 at 19:25