1

I would like to create one dataframe of several .csv files without losing any columns (i.e. for any of the .csv's that don't have a particular column, that space would be filled with NA. I would like this process to align them by column name but the order of the columns across .csv's also does not always match.

I've created a list of .csv files from a folder which only has said files

files <- dir("C:/...")

I would like to read in these .csv files into one dataframe. What I've got so far...

table_all <- do.call(rbind.fill(ldply(files, read.csv, 
stringsAsFactors= TRUE, header= T, sep= ",")))

I assume the solution involves do.call and some combination of rbind, bind_rows or rbind.fill. I've read a bit about rbindlist being computationally lighter, but it only matches by position, and as my .csv's have columns out of order, I need something to match by name.

LukeP
  • 93
  • 8
  • do you know which files have different columns? Do you have any set of predetermined columns ? The solution to this might imply comparison between the columns that you do have and the ones you do not. It's hard without a reproducible example to know how to create an `if` statement that can catch them. – Matias Andina Oct 04 '19 at 18:58
  • Tidyverse's `bind_rows` matches by name. See the answer by tchakravarty here: https://stackoverflow.com/questions/37502991/reading-many-csv-files-at-the-same-time-in-r-and-combining-all-into-one-datafram – GenesRus Oct 04 '19 at 22:12

1 Answers1

2

The general way of solving this problem would need some steps. See pseudo-code below (until we get a better handle of your particular example):

# step 1 -- list files and prepare columns
file_list <- list.files(path="your_path",
                        pattern="your_pattern",
                        full.names=TRUE)
all_columns <- c("list", "your", "columns", "here")
# ideally all_columns will come from names(df)
# with df being your most complete df
# step 2 -- read and match columns before binding
li <- purrr::map(file_list,
function(file){
df <- read.csv(file)

current_names <- names(df)

# find what names are missing
# do mutate(missing_names = NA)

return(df)
}
)
# step 3 -- bind
output <- bind_rows(li)
Matias Andina
  • 4,029
  • 4
  • 26
  • 58