0

how can I import csv files according to their column names instead of their file names? Let's say i have 2 different csv files in my working directory. One of the files names is "interesting.csv" with column names as follows: "interesting1" "interesting2" "interesting3".

I am looking for something that scans this folder, this working directory, and check the files by their column names so i can read in the file "interesting.csv" with read_csv2. I would like to know how it's done, because sometimes the file name would change. Let's say "interesting.csv" becomes "interesting_cool.csv", the column names wouldn't change, though. In this case read_csv2("interesting.csv") wouldn't work, because there wouldn't be such a file.

Is there a function, that "scans" all the files in the folder for their headers and compare it to the names i supply to the Rscript?

Something like this: read_csv2(find_file_with_headers("interesting1", "interesting2", "interesting3"))

I am sorry if this is a duplicate, i couldn't find what i need, though.

Regards.

Update to Ronan's approach:

file_list1 <- list.files(getwd(), full.names = TRUE, pattern = "\\.csv$")
file_list2 <- list.files(getwd(), full.names = TRUE, pattern = "\\.CSV$")
(file_list <- c(file_list1, file_list2)); rm(file_list1, file_list2)

col_names = c("interesting1" "interesting2" "interesting3")

file_index <- which(sapply(file_list, function(x) 
  all(col_names %in% names(read.csv2(x, nrows = 0)))))[1]

return(read.csv2(file_list[file_index]))

If i split it up like this, file_index works fine while "file_index" will result in one NA. If the headers fit, this shouldn't happen, right? Therefor return wouldn't work either and give out the error: Error in file(file, "rt") : invalid 'description' argument

Dutschke
  • 277
  • 2
  • 15
  • The only way I see would be to read the first line of all the csv files get their headers then if the header match some criteria import the file. You might want to reconsider this strategy though. It is much more tractable to keep the file name of the table fixed. A golden mean could be to increment the filename by a prefix say, interesting_6.csv, nteresting_7.csv and importing the latest one. – DJJ Sep 02 '20 at 08:56
  • Can you print what do you get in `head(file_list)` for me? – Ronak Shah Sep 03 '20 at 09:15
  • [1] "C:/Users/Dutschke/file1.csv" "C:/Users/Dutschke//file2.csv" [3] "C:/Users/Dutschke/file3.csv" "C:/Users/Dutschke/file4.csv" [5] "C:/Users/Dutschke/file5.CSV" and one of these files has the columns "interesting1" "interesting2" "interesting3" – Dutschke Sep 03 '20 at 10:01
  • `file_index` should have a file number which has all the columns in `col_names`. So I think none of the file have all the columns that is why it returns `NA`. Make sure there is no typo, leading/lagging whitespace in column name. They need to match exactly. What does `names(read.csv2(file_list[1], nrow = 0))` return? – Ronak Shah Sep 03 '20 at 10:03
  • I see now what the problem is. R does something weird. The first column is "IK" and when i do ```names(read.csv2(file_list[1], nrow = 0))``` "ï..IK". Also for every "-" R shows me "." Changed it and your code works. Jesus... one last question. Why did you change the read_csv2 into read.csv2? – Dutschke Sep 03 '20 at 10:18
  • That is because I couldn't find `nrows` or similar argument in `read_csv2` which would basically read only the header information and not the entire file from all the files. – Ronak Shah Sep 03 '20 at 10:23
  • read.csv2 and read_csv2 give me the same results, though + read_csv2 gives me parsing information. – Dutschke Sep 03 '20 at 10:26
  • 1
    Yes, results would be same but if your files are very big `read_csv2` would be slower since `read_csv2` is reading all the lines of the file vs `read.csv2` reading only 1 (header) line. – Ronak Shah Sep 03 '20 at 10:27

1 Answers1

2

Not sure if there is an out-of-the-box solution in R.

Here is one way where you read the column names of all the files in the folder and return the complete file where all the column names passed are matched.

return_correct_file <- function(path, col_names) {

  file_list <- list.files(path, full.names = TRUE)
  file_index <- which(sapply(file_list, function(x) 
                     all(col_names %in% names(read.csv2(x, nrows = 0)))))[1]
  return(read.csv2(file_list[file_index]))
}

You can call this function as :

data <- return_correct_file(path = 'path/to/csv/files', 
         col_names = c("interesting1", "interesting2", "interesting3"))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Error in file(file, "rt") : invalid 'description' argument – Dutschke Sep 02 '20 at 12:27
  • @Dutschke It means that your file path is not correct. https://stackoverflow.com/questions/14420936/error-in-filefile-rt-invalid-description-argument-in-complete-cases-pro Did you change `'path/to/csv/files'` according to your system path? Try to read one file first `file_list <- list.files('path/to/csv/files', full.names = TRUE)` and `read.csv2(file_list[1])`. What does it return? – Ronak Shah Sep 02 '20 at 13:30
  • If I do it like this, R imports the file, without giving Error or Warning. Trying it again with all files now gives me: Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : Error in file(file, "rt") : cannot open the connection...... And yes, i already checked this https://stackoverflow.com/questions/27466317/error-in-filefile-rt-cannot-open-the-connection out but nothing there helped. – Dutschke Sep 02 '20 at 13:45
  • That's weird. I cannot reproduce this on my system. It works fine for me with or without function. I hope you have `full.names = TRUE` in your `list.files`. What if you remove the `file_list <- list.files(path, full.names = TRUE)` from the function and pass `file_list` as an argument to function instead? Does that change anything? – Ronak Shah Sep 02 '20 at 13:51
  • So the second Error was because in the folder i also had other files than csv. I filtered it with pattern = "\\.csv$" but the following Error still remains: Error in file(file, "rt") : invalid 'description' argument... also if i remove what you wrote. – Dutschke Sep 02 '20 at 14:08
  • https://stackoverflow.com/questions/14420936/error-in-filefile-rt-invalid-description-argument-in-complete-cases-pro i don't see how this is helping me. – Dutschke Sep 02 '20 at 14:16
  • The simple meaning of the error message is that the file path is not correct. Break down into even smaller steps. Remove everything from the function and keep only one line hardcoding path of one random file `read.csv2('/path/anyfile.csv')` and run the function. Does it work? – Ronak Shah Sep 02 '20 at 14:20
  • Updated the question and broke it down. The problem is, that i don't really understand, what the result of file_index should look like. – Dutschke Sep 03 '20 at 08:37