How to take only common columns across multiple csv's while appending data

Question

I am currently using the below function to read in and combine several(7) csv's in R.

csv_append <- function(file_path = filePath){

  files <- grep(list.files(path = file_path,full.names = TRUE), pattern="final_data_dummied_", value=T)
  ###Load all files into a list of dataframes
  df_list = lapply(files,fread,nThread = 4)
  DT = rbindlist(df_list,fill = TRUE)
  # Convert data.table to dataframe
  df_seg = setDF(DT)
  rm(list = c('DT','df_list'))
  # Replace missing values with 0
  df_seg[is.na(df_seg)] <- 0
  return (df_seg)
}

However the original files are large(.5 million rows and ~3500 columns). The number of columns vary from 3400 to 3700 and when I combine these files R gives memory error : cannot allocate vector of size 85Gb I am thinking if I take intersection of columns from all the csvs and read in only those columns from each csv it might solve the problem. But I am not sure how can I do that while reading in the files.

Can someone please help me with this?

First, read in the first line of every CSV to get all the column names (use `fread` with `nrow = 1`). Then use `Reduce` with `intersect` to find the common columns. Then use one [of the standard methods for reading only some columns](https://stackoverflow.com/a/33201353/903061) - you seem to be using `data.table` so the `fread` method is probably best. See the question I linked or `?fread` for details. — Gregor Thomas, Jun 07 '18 at 14:15
And, while generally I am a strong advocate for using lists of data frames, if you are bumping up against memory limits you may do better to use a for loop and just go one file at a time rather than all at once. — Gregor Thomas, Jun 07 '18 at 14:19

How to take only common columns across multiple csv's while appending data

0 Answers0