How to process multiple csv files for identifying null values in R?

Question

I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. The code works well for a single csv file. But I want to run the code for all the csv files and need output for each csv file. Additionally, I want a log file. Could anyone please help me by modifying the code how it can be used to process various csv files.

install.packages("readr") 
library(readr)

check_column <- function(df, column) {
  valid_values <- !is.na(df[[column]])
  num_valid <- sum(valid_values)
  num_null <- nrow(df) - num_valid
  return(c(num_valid, num_null))
}

#Read the CSV file
df <- read_csv("data.csv")

for (column in names(df)) {
  results <- check_column(df, column)
  print(paste(column, ": ", results[1], " valid, ", results[2], " null"))
}

Sample data: (Not all files have same number of columns)

Csv1.csv

D_T  Temp (°C)  Press (Pa)  ...
2021-03-01 00:00:00+00  28  1018  ...
2021-03-02 00:00:00+00  27  1017  ...
2021-03-03 00:00:00+00  28  1019  ...
..
..

Csv2.csv

D_T  Temp (°C)  Vel (m/s)  Press (Pa_...
2022-03-01 00:00:00+00  28  118  1018  ...
2022-03-02 00:00:00+00  27  117  1019  ...
2022-03-03 00:00:00+00  28  119  1018  ...
..
..

See https://stackoverflow.com/a/24376207/3358227 for many discussions of how to do things on a list of tables, including how to make that list to begin with. It tends to use `lapply` and friends, but the premise remains solid. — r2evans, Jul 07 '23 at 01:55

Ludwig · Accepted Answer · 2023-07-10T23:04:55.630

1

How about something like this? This will not store anything in a variable. Let me know if you need help with it.

library(readr)    
    
for(files in list.files(pattern=".*csv$")) {
    file <- read_csv(files)
    out <- file(paste0(files, ".log"), open = "w")
    sapply(colnames(file), function(x) {
            cat(
                    paste0(x, ":"),
                    sum(!is.na(file[, x])),
                    "valid,",
                    sum(is.na(file[, x])),
                    "null\n",
                    file = out
            )
    })
    close(out)
}

To write into one file only:

library(readr)    

out <- file("output.log", open = "w")
for(files in list.files(pattern=".*csv$")) {
        file <- read_csv(files)
        cat(files, "\n", file = out)
        sapply(colnames(file), function(x) {
                cat(
                        paste0(x, ":"),
                        sum(!is.na(file[, x])),
                        "valid,",
                        sum(is.na(file[, x])),
                        "null\n",
                        file = out
                )
        })   
}
close(out)

edited Jul 10 '23 at 23:04

answered Jul 06 '23 at 23:02

Ludwig

51
4

Thank you for your response. But I have more than 50 csv files. (I gave an example of only 2 csv files). Could you please modify your code accordingly. – Michael_Brun Jul 06 '23 at 23:26
Have you tried it? It works for any number of files. Set your working directory to where the files are with ```setwd()```. ```list.files()``` lists every file in your current working directory. Inside the loop every file gets read and its contents displayed. I just created two example files to show that it works. – Ludwig Jul 06 '23 at 23:34
Okay, I set my working directory setwd("F:/Test") and list files as list.files(path="F:/Test/",pattern='.*csv',full.names=TRUE) and followed your code from for loop. It gives me error: Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input). Kindly help. – Michael_Brun Jul 06 '23 at 23:45
1

Try the readr version ```read_csv()``` instead of ```read.csv()``` if it worked for you before. I also posted an updated version that will write the output into a log file instead of into the console. – Ludwig Jul 07 '23 at 00:10
`duplicate 'row.names'` means something about that file makes reading it problematic. That has _nothing_ to do with the fact that it was called in a `for` loop, it seems more likely to me that either (a) not all files are CSV files, or (b) not all of them are well-structured CSV. Check the file that erred and check it _manually_. Can you read it by itself? Are there more `,`-delimited tokens on the first line than on the second? – r2evans Jul 07 '23 at 01:54
@Ludwig Instead of 1 log file for each csv input file, how I can create 1 log file for all the csv files? – Michael_Brun Jul 10 '23 at 16:37
@Michael_Brun added it to the answer above – Ludwig Jul 10 '23 at 23:03

How to process multiple csv files for identifying null values in R?

1 Answers1

Linked