1

I have 1000 CSV files from dfr.jstor.org with two columns, KEYWORDS and WEIGHT. The length of each column varies from file to file. Here's a snippet of one CSV file:

KEYTERMS  WEIGHT
canoe     1
archaic   0.273
pinus     0.191
florida   0.164

I want to use R to get the KEYTERMS column from each CSV file and merge it into a single data frame like this:

KEYTERMS_CSVFILENAME1 KEYTERMS_CSVFILENAME2 KEYTERMS_CSVFILENAME3
thwart                newsom                period 
dugout                site                  cypress 
sigma                 date                  hartmann 
precontact            NA                    florida 
orange                NA                    NA

Where CSVFILENAME1 is the name of the CSV file where those keywords came from and NA is an empty cell.

I think my problem is very simliar to this one with the difference that I have varying column lengths. This may also be relevant to a solution, and this looks right on topic, but I need a bit of hand-holding to make it suit my situation. Thanks in advance!

Community
  • 1
  • 1
Ben
  • 41,615
  • 18
  • 132
  • 227
  • This is fairly open ended. How about you get started using the information from the questions you linked to and then come back when you have a specific problem? – joran Nov 09 '11 at 23:34
  • Well I'm not exactly sure where to start! Any tips or clues would be much appreciated. – Ben Nov 09 '11 at 23:42
  • Try reading only two csv's into R and see if you can merge them. – joran Nov 09 '11 at 23:47
  • Right, I can get two csv file in and merge them using cbind and the function [here](http://stackoverflow.com/questions/6754474/cbind-two-data-frames-with-different-rownames-and-numbers-of-rows). Where to next? – Ben Nov 10 '11 at 00:06

3 Answers3

3

If you aren't constrained by memory space at all, something like:

datlist <- lapply(csvnames,read.csv)
maxlen <- max(sapply(datlist,nrow))
pad.NA <- function(x,len) {
   c(x,rep(NA_character_,len-length(x)))
}
keylist <- lapply(datlist,function(x) { pad.na(x[["KEYTERMS"]],maxlen) })
names(keylist) <- paste(KEYTERMS,csvnames,sep="_")
do.call(cbind,keylist)

might work (edit: added missing parenthesis, NA_character_)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks, I can see what you've done with lapply and sapply there, that's helpful. But I get this message `Error: unexpected '}' in: "c(x,rep(NA,len-length(x)) }"` at the end of the block... – Ben Nov 10 '11 at 00:10
  • .@Ben - add an end parenthesis then : `c(x,rep(NA,len-length(x)))` – Tommy Nov 10 '11 at 00:13
3

To save a LITTLE memory/time you could modify the solution from @Ben Bolker like this:

datlist <- lapply(csvnames,read.csv, colClasses=c("character", "NULL"))
rowseq <- seq_len( max(vapply(datlist,nrow, integer(1))) )
keylist <- lapply(datlist,function(x) { x[[1]][rowseq] ) })
names(keylist) <- paste(KEYTERMS,csvnames,sep="_")
#do.call(cbind,keylist)
do.call(data.frame,keylist)

...I just changed so that only the first column is read, and simplified the NA padding by observing that selecting a sequence that extends outside a character vector pads with NA automatically...

If you kept the old way of padding, you should at least pad with NA_character_ instead of NA to avoid unnecessary coercion.

I also index the KEYTERMS column by number instead of name (since there should be only one). I also changed sapply to vapply because I like it better :) - it actually is faster too.

Finally you said you wanted a data.frame. The last line produces that instead of a matrix.

Tommy
  • 39,997
  • 12
  • 90
  • 85
  • This is very helpful, thanks very much! I just took a `)` out of line two after `[rowseq]`and changed `KEYTERMS` to `"KEYTERMS"` and it's done the job perfectly. Very instructive, thanks again. – Ben Nov 10 '11 at 05:13
2

Here is a slightly simpler solution using ldply from plyr and reshape from base

# READ CSV FILES INTO LIST (i am using a dummy datlist for illustration)
# datlist <- lapply(csvnames,read.csv, colClasses=c("character", "NULL"))
datlist <- list(
  file1 = data.frame(KEYWORDS = c('thwart', 'dugout', 'sigma', 'precontact')),
  file2 = data.frame(KEYWORDS = c('newsom', 'site', 'date'))
)

# BIND THEM INTO A DATAFRAME AND RESHAPE TO DESIRED FORM
datdf <- plyr::ldply(datlist, function(x) data.frame(x, id = 1:NROW(x)))
reshape(datdf, timevar = '.id', direction = 'wide', sep = "_") 

   id KEYWORDS_file1 KEYWORDS_file2
1  1         thwart         newsom
2  2         dugout           site
3  3          sigma           date
4  4     precontact           <NA>
Ramnath
  • 54,439
  • 16
  • 125
  • 152