5

I have a directory containing close to 2,000 .csv files.

Each file has the following structure (showing 4 out of 500 rows):

                       Date;QOF
1    2004-01-04 - 2004-01-10;9
2    2004-01-11 - 2004-01-17;11
3    2004-01-18 - 2004-01-24;13
4    2004-01-25 - 2004-01-31;13

The column "QOF" is also the name of the .csv file and each file has a unique name (e.g. "MSTF", "XQS" etc.) I would like this column from each .csv file to be merged on to the first .csv file being read which also contains the date variable. In other words I want to keep all columns from the first file and merge only the second column from all other .csv files on to this file. End result should be something like:

                    Date;QOF;MSTF;XQS
1    2004-01-04 - 2004-01-10;9;10;8
2    2004-01-11 - 2004-01-17;11;11;5
3    2004-01-18 - 2004-01-24;13;31;2
4    2004-01-25 - 2004-01-31;13;45;23

So far I have tried this:

filenames <- list.files()

do.call("cbind", lapply(filenames, read.csv, header = TRUE))
Sunv
  • 239
  • 1
  • 4
  • 11
  • Use merge not cbind. Make sure to pass in appropriate arguments. – stanekam Dec 18 '13 at 18:52
  • @iShouldUseAName : How exactly would you use merge here? And what do you mean by "Make sure to pass in appropriate arguments"? – Sunv Dec 18 '13 at 18:58
  • 3
    If the first column `Date` is exactly the same in all files, then I think `cbind` is a fine approach. Try `do.call(cbind, lapply(filenames, read.table, header = TRUE, row.names = 1, sep = ';'))` – flodel Dec 18 '13 at 19:05
  • @flodel : thanks, it worked perfectly! Say there is a file in the directory where the `Date` column has fewer rows - how could I modify the program to not include that file? – Sunv Dec 18 '13 at 20:05
  • Choose a different data structure. – IRTFM Dec 18 '13 at 21:05
  • @IShouldBuyABoat : It's not possible to choose a different data structure. Do you have a suggestion for how to include a condition that only includes files with e.g. `nrow=500`, using `do.call(cbind, lapply(filenames, read.table, header = TRUE, row.names = 1, sep = ';'))`? – Sunv Jan 09 '14 at 13:51

1 Answers1

4
mybig <- do.call( rbind, lapply( listfiles, function(nam){ 
                       cbind(name=nam, read.file(paste0(nam,".csv"), header=TRUE) )
                                                }
        )              )

Untested. And notice that I intentionally did not follow the structure you suggested. I cannot thnk of a more confusing data structure to work with down the line. You might be thinking of using that format for output and would first need to build a dataframe and then write it to a file with semi-colon delimiter.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Do you have a suggestion for how to include a condition that only includes files with e.g. `nrow=500`, using `do.call(cbind, lapply(filenames, read.table, header = TRUE, row.names = 1, sep = ';'))`? – Sunv Jan 09 '14 at 13:55
  • 1
    If you are asking to include files with 500 or more lines then you would first need to import and then test whether the file was sufficient. There is no line count attribute that the system could use to pre-qualify files. I suppose you could come up with a size surrogate. Take a look at: `names( file.info(dir()) )` – IRTFM Jan 09 '14 at 20:18
  • 1
    The other thought would be to use a system utility like 'awk' to pre-qualify the files. I'm not a regular user of 'awk' but I feel confident that you could get worked examples of appropriate doce to deliver line counts and then loop through `dir()` or `list.files()`. – IRTFM Jan 11 '14 at 17:52
  • If I wanted to merge these files using the date variable as the "by" variable instead of `cbind`, is there a convenient way to do this when dealing with 2,000 .csv files? – Sunv Jan 12 '14 at 17:38
  • The answer will depend on features of the problem that you have not described and more importantly have not illustrated with a reproducible example. It would be critical to know whether the same dates will be appearing in all of the files or whether some of the data will be missing. It appears that you want to first buils an R data structure and then output it as a semi-colon-separated file. You should either edit this question or post a new one to include two or three test fails with a degree of complexity that answers the outstanding questions. – IRTFM Jan 12 '14 at 18:27