0

I have a question that is a combination of two different posts.

Combine (rbind) data frames and create column with name of original data frames

Compare the column names of multiple CSV files in R

I am trying to rbind / rbindlist multiple files together in a folder. Some of my variables are named differently in one file compared to another. I want to identify the specific files where the variables are mislabeled. From here, I can adjust the individual files manually. I could switch use.names to FALSE, but the columns are not in the proper order. As such, when I use rbindlist, I will usually get an extra series of varibles at the end with the different spellings used in some files.

names (data1)
id, var1, var2, var3
names (data2)
id, variable1, variable2, variable3

names (finaldata)
id, var1, var2, var3, variable1, variable2, variable3

I think the best way to handle this would be to create a new dataframe / file containing all the different files and the corresponding variable names. I can then scan this new file and see which of the original files may need editing.

First, I created a function that would read an individual file and then give me a listing of the file name and the variables in the file.

varnames <- function (file) {
  temp <- read_excel (file)
  colnames (temp)
  x <- colnames (temp)
  y <- file
  z <- c (y, x)
  as.data.frame (z)
}

This would give me the following output.

varname (data1)
    data1
    var1
    var2
    var3
    ...

Next, I can generate a listing of all files as if I were using rbindlist.

file.list <- list.files(path = "file_path", pattern='*.xlsx', full.names = FALSE)

From here, I can use cbind to bring all the elements together. Now I can use any method to identify errors (scan the new data file, create a dummy variable & subset, etc.,).

    cbind(lapply(file.list, varnames)) 
    errors <- cbind(lapply(file.list, varnames))
    write.csv (cbind(lapply(file.list, varnames)), "variable_names_across_files.csv")

Here is the ideal output I am looking to generate.

filename. var_1_name var_2_name var_3_name
file1     variable1  var_2      var3
file2     variable1  variable2  variable3
file3     var1       var2       var3
file4     var1       var2       var3
file5     var1       variable2  var3
 

The solution I have come up with does work but is very inefficient. This code will take a long time to run. For context, I am working with about 2000 files each with about 120 cases each. Due to the final size of the dataset, this may be unavoidable but I wanted to see if there was another way to look at and approach this problem.

2 Answers2

0

Write each file to csv using Excel or the fast utility xlsx2csv written in rust or other method and all files are in current directory and all files in that directory are wanted:

xlsx2csv <- \(x) system(sprintf("xlsx2csv %s %s", x, sub("xlsx$", "csv", x)))
lapply(Sys.glob("*.xlsx"), xlsx2csv)

We assume that each file has the same number of columns, first run this producing a matrix where each row shows the headings of one file -- the file names will be the row names of that matrix. Using that manually fix the column names to make them consistent. They do not have to be in the same order.

Files <- Sys.glob("*.csv")
t(mapply(read.table, Files, sep = ",", nrows = 1))

Then use this to read them into a single file.

library(dplyr)
rbind_rows(Map(read.csv, as.list(Files)))

Reproducible Test

library(dplyr)

# write test files
write.csv(BOD[1:3,], "BOD.csv", quote = FALSE, row.names = FALSE)
write.csv(BOD[1:2, 2:1], "BOD2.csv", quote = FALSE, row.names = FALSE)
 
Files <- Sys.glob("BOD*.csv")
t(mapply(read.table, Files, sep = ",", nrows = 1))
##          V1       V2      
## BOD.csv  "Time"   "demand"
## BOD2.csv "demand" "Time"  

# looks ok so proceed - if not we would edit headings first
bind_rows(Map(read.csv, as.list(Files)))
##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    1    8.3
## 5    2   10.3
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

Some of the problem might be that the excel files are slower to read than csv files. One small tweak would be to add the n_max = 0 argument to just read in the colnames in your code. Illustrated using a varnames_short version of your function:

dir.create("temp_excels")

sapply(sample(letters, 15), \(x) runif(n = 100000), simplify = FALSE) |> 
  {\(x) append(list(group = sample(LETTERS, 100000, replace = TRUE)), x)}() |> 
  as.data.frame() |> 
  openxlsx::write.xlsx("temp_excels/testfile1.xlsx")

sapply(sample(letters, 15), \(x) runif(n = 100000), simplify = FALSE) |> 
  {\(x) append(list(group = sample(LETTERS, 100000, replace = TRUE)), x)}() |> 
  as.data.frame() |> 
  openxlsx::write.xlsx("temp_excels/testfile2.xlsx")

sapply(sample(letters, 15), \(x) runif(n = 100000), simplify = FALSE) |> 
  {\(x) append(list(group = sample(LETTERS, 100000, replace = TRUE)), x)}() |> 
  as.data.frame() |> 
  openxlsx::write.xlsx("temp_excels/testfile3.xlsx")


library(readxl)

varnames <- function (file) {
  temp <- read_excel (file)
  x <- colnames (temp)
  y <- file
  z <- c (y, x)
  as.data.frame (z)
}

varnames_short <- function (file) {
  temp <- read_excel (file, n_max = 0)
  x <- colnames (temp)
  y <- file
  z <- c (y, x)
  as.data.frame (z)
}

file.list <- list.files(path = "temp_excels", pattern='*.xlsx', full.names = TRUE)

microbenchmark::microbenchmark(
  cbind(lapply(file.list, varnames)),
  cbind(lapply(file.list, varnames_short)),
  times = 3
)
#> Unit: seconds
#>                                      expr      min       lq     mean   median
#>        cbind(lapply(file.list, varnames)) 5.261145 5.281157 5.319038 5.301169
#>  cbind(lapply(file.list, varnames_short)) 2.566850 2.662395 2.701599 2.757940
#>        uq      max neval cld
#>  5.347984 5.394799     3  a 
#>  2.768974 2.780008     3   b

This might halve your time if you have long datasets. But might not do everything you need. You can also parallelise your file reading.

Andy Baxter
  • 5,833
  • 1
  • 8
  • 22