1

big picture: I have a large number of CSV files (of the same format) I am processing with R. my practice is to use file.append to concatenate them into a small number of larger files and process from there. now I have a new issue: the csv files are arriving with headers so if I just append them the headers (of the appended files except for the first) will mix with the data. I'm looking for an efficient solution; how should I tackle this optimizing processing time and memory?

in particular:

[1] what is the most efficient way to remove the header / first line in the file, assuming the header is usually very small compared to the overall size of the file?

[2] are the resources required proportional to the file size or there is just a fixed overhead?

[3] is there a different approach in which I can append the files without their header, without actually reading the whole files to memory?

update and clarification: if I could load all files to memory the problem would have been easy. I cannot load all data at once in any case. I am per-processing the files in order to understand the portions I require. since there are so many files, the number of files itself becomes a bottleneck: loading each of the files, gathering the info I need from each file, and then combining the info becomes the bottleneck. therefore,as a middle ground, concatenating/appending the files into a larger chunks worked well until now - file.append works efficiently without actually reading the files. since now the csv files contains headers I would like to find a way to append the files without reading all of their content. I am able read and append them in chunks, but once again it would slow my process, adding another expensive full read of the content.

kamashay
  • 107
  • 3
  • 10
  • Read the files all in to a `list` and `rbind`? – A5C1D2H2I1M1N2O1R2T1 Mar 22 '17 at 17:38
  • What's your actual objective here: to get all of the CSVs into a single object in R, or to combine them into a single file on disk for some other purpose? If you want to get them into memory, there are a few different approaches; [start with this question](http://stackoverflow.com/questions/11433432/importing-multiple-csv-files-into-r). If you're trying to combine the files without reading them into memory, R probably isn't the right tool, since it inherently requires the data to be in memory (with some exceptions). – Matt Parker Mar 22 '17 at 17:52
  • the data.table package has two very efficient functions for reading and writing csv files: fread and fwrite. fread can treat the header as column names. fwrite has an append option and can write without column names. – takje Mar 22 '17 at 17:52
  • 1
    Possible duplicate of [Importing multiple .csv files into R](http://stackoverflow.com/questions/11433432/importing-multiple-csv-files-into-r) – Matt Parker Mar 22 '17 at 17:53
  • Matt - I do not want to get all the CSVs into a single object in R - I do not have the memory. could you clarify your argument - are you saying that one cannot remove the first line in a file without reading it as a whole in R? file.append read the files it appends? is there no workaround to append without the first line? if R really is not the tool - can you recommend a tool in windows? – kamashay Mar 22 '17 at 19:45
  • kamashay - exactly, for R to do that work, you'd have to read the files. `file.append` is really just a thin wrapper to your operating system's file-appending mechanism, so R is pushing that work out to the OS. You could do that directly and use the Windows command line to do this work (although I don't know which commands you'd use). I think it would be relatively straightforward in bash, too. – Matt Parker Mar 22 '17 at 19:53
  • Another alternative that lets you use your R skills would be to use the RSQLite package to set up a (free, lightweight) SQLite DB, create a table, then read one CSV file into memory at a time, append it to that table, read the next table, etc. SQLite would also let you do the heavy lifting of filtering and aggregation without needing to read the data into memory. – Matt Parker Mar 22 '17 at 19:53
  • Matt, thanks for your advice. can you comment how would you implement this (cat without headers ) in bash? perhaps I will consider moving the data to Linux. – kamashay Mar 22 '17 at 20:13
  • Sure - I think the `awk` approach [in this answer](http://stackoverflow.com/a/16890695/143319) looks great. There are a few other approaches in [the answers to this question](http://unix.stackexchange.com/questions/60577/concatenate-multiple-files-with-same-header), too. – Matt Parker Mar 22 '17 at 20:22

0 Answers0