I have a question that is a combination of two different posts.
Combine (rbind) data frames and create column with name of original data frames
Compare the column names of multiple CSV files in R
I am trying to rbind / rbindlist multiple files together in a folder. Some of my variables are named differently in one file compared to another. I want to identify the specific files where the variables are mislabeled. From here, I can adjust the individual files manually. I could switch use.names to FALSE, but the columns are not in the proper order. As such, when I use rbindlist, I will usually get an extra series of varibles at the end with the different spellings used in some files.
names (data1)
id, var1, var2, var3
names (data2)
id, variable1, variable2, variable3
names (finaldata)
id, var1, var2, var3, variable1, variable2, variable3
I think the best way to handle this would be to create a new dataframe / file containing all the different files and the corresponding variable names. I can then scan this new file and see which of the original files may need editing.
First, I created a function that would read an individual file and then give me a listing of the file name and the variables in the file.
varnames <- function (file) {
temp <- read_excel (file)
colnames (temp)
x <- colnames (temp)
y <- file
z <- c (y, x)
as.data.frame (z)
}
This would give me the following output.
varname (data1)
data1
var1
var2
var3
...
Next, I can generate a listing of all files as if I were using rbindlist.
file.list <- list.files(path = "file_path", pattern='*.xlsx', full.names = FALSE)
From here, I can use cbind to bring all the elements together. Now I can use any method to identify errors (scan the new data file, create a dummy variable & subset, etc.,).
cbind(lapply(file.list, varnames))
errors <- cbind(lapply(file.list, varnames))
write.csv (cbind(lapply(file.list, varnames)), "variable_names_across_files.csv")
Here is the ideal output I am looking to generate.
filename. var_1_name var_2_name var_3_name
file1 variable1 var_2 var3
file2 variable1 variable2 variable3
file3 var1 var2 var3
file4 var1 var2 var3
file5 var1 variable2 var3
The solution I have come up with does work but is very inefficient. This code will take a long time to run. For context, I am working with about 2000 files each with about 120 cases each. Due to the final size of the dataset, this may be unavoidable but I wanted to see if there was another way to look at and approach this problem.