3

I have a relatively large csv files (1.2gb... large to the 2gb RAM on one of my computers). To every entry in one column I would like to append "1C" so that I can join/merge with another dataframe/db table.

If the file weren't so large, it would be easy to use read.csv to import to data then use data$symbol <- paste(data$symbol, "1C", sep=""). But now I get the can't allocate vector of size x warning.

Is a manual solution, like scan(), my only option? (I'm a bit afraid of corrupting my data) Thanks!

Richard Herron
  • 9,760
  • 12
  • 69
  • 116

4 Answers4

12

This is kinda ugly, but you could read the file, line by line and put it in another file, line by line. That way you would never read the whole thing into memory. This would be not unlike Joshua's awk approach, but using R (and considerably slower). The following example was one I had cobbled together to remove dollar signs and percent signs from a csv. You can alter the logic in the middle to do the logic you want.

con  <- file(inputFile, open = "r")
con2 <- file(temporaryFile, open = "w")

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    oneLine <- gsub("%", "", oneLine)     #strip out the percent signs
    oneLine <- gsub("\\$", "", oneLine)   #strip out the dollar signs
    cat(oneLine, file = con2, sep = "\n") #spit the line back out
  } 

close(con)
close(con2)
JD Long
  • 59,675
  • 58
  • 202
  • 294
7

Using scan isn't going to help if you can already get your data into R.

Make sure data only has the columns you need to merge, and run gc() before you try your paste command (gc will help if you're near the margin of your memory limit).

If that fails, look at some of the solutions in this thread.


UPDATE:
And if you happen to be using a flavor of *nix, or if you have Rtools installed on windows, you could do this with gawk. If your data are in foo.csv and you want to add the "C1" to the second column, this will create a new file, bar.csv, with "C1" appended to the second column.

compy: /home/josh
> cat foo.csv 
1,one,2,two
3,three,4,four
5,five,6,six

compy: /home/josh
> gawk -F "," '{OFS=","; $2=($2 "1C"); print}' < foo.csv > bar.csv

compy: /home/josh
> cat bar.csv 
1,one1C,2,two
3,three1C,4,four
5,five1C,6,six

This will likely be faster than R and will consume a negligible amount of memory.

Community
  • 1
  • 1
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • @Joshua -- Thanks for the link! But after I get the whole csv in RAM, I don't have enough memory to `paste` back to itself (even with an element by element loop). Before I do the merge, I select just a few columns from the SQLite db that holds my data, so there's no memory issue there. I would prefer to append the "1C" before I add the csv to the database so that I can do a `join` there instead of a `merge` in R. Thanks! – Richard Herron Nov 02 '10 at 21:40
  • @Joshua -- I'm either on either Ubuntu or OSX, so I'll give this a try! – Richard Herron Nov 02 '10 at 21:43
  • @Joshua -- Thanks! Exactly what I was looking for! I'm still a nub at linux, so I need to dig more into these basic tools! – Richard Herron Nov 02 '10 at 21:56
  • 2
    @richardh if you are brining this into SQLite is there a compelling reason not to do the append operation inside of SQLite? – JD Long Nov 02 '10 at 22:01
  • @JD -- None other than ignorance :). I read over the FAQ and did some fruitless googling before I posted. Is it the `||` operator? – Richard Herron Nov 02 '10 at 22:12
  • 1
    Yep, you got it! Do the ||, then index, if needed, then join. – JD Long Nov 02 '10 at 23:11
2

Read only the one column and what I am assuming might be a leading unique identifier for later merge operations into memory:

twocol <- read.table(file="what.csv", colClasses=("numeric", "character", rep("NULL", <number-of-remaining-columns>) , sep="," )
twocol$V2 <- paste(twocol$V2, "1C", sep="")
merge(twocol, otherdf, by.x="V1" , ....)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks! This is similar to what I was doing. My office computer is a bit old and couldn't handle the first two lines. The `merge` was fine b/c I was subsetting in SQLite before the merge, so that it was merging dataframes much smaller than the original csv file. – Richard Herron Nov 02 '10 at 22:41
0

Appends 1C to the second column:

perl -F, -lane '$F[1] .= "1C"; print join ",", @F' file.csv

These command-line options are used:

  • -n loop around every line of the input file
  • -l removes newlines before processing, and adds them back in afterwards
  • -a autosplit mode – split input lines into the @F array. Defaults to splitting on whitespace.
  • -e execute the perl code
  • -F autosplit modifier, in this case splits on ,

@F is the array of words in each line, indexed starting with $F[0]
.= is the "append to" operator

Chris Koknat
  • 3,305
  • 2
  • 29
  • 30