Trimming big data

Question

I am working on a similar issue as was stated on this other posting and tried adapting the code to select the columns I am interested in and making it fit my data file.

My issue, however, is that the resulting file has become larger than the original one, and I'm not sure the code is working the way I intended.

When I open with SPSS, the dataset seems to have taken in the header line, and then made millions of copies without end of the second line (I had to force stop the process).

I noticed there's no counter in the while loop specifying the line, might this be the case? My background in programming with R is very limited. The file is a .csv and is 4.8GB with 329 variables and millions of rows. I only need to keep around 30 of the variables.

This is the code I used:

##Open separate connections to hold cursor position

file.in <- file('npidata_20050523-20130707.csv', 'rt')
file.out<- file('Mainoutnpidata.txt', 'wt')
line<-readLines(file.in,n=1)
line.split <-strsplit(line, ',')

##Column picking, only column 1

cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311], sep = ",", file = file.out, fill= TRUE)

##Use a loop to read in the rest of the lines
line <-readLines(file.in, n=1)
while (length(line)){
    line.split <-strsplit(line, ',')
if (length(line.split[[1]])>1) {
        cat(line.split[[1]][1:11],line.split[[1]][23:25], line.split[[1]][31:33], line.split[[1]][308:311],sep = ",", file = file.out, fill= TRUE)
    }
}
close(file.in)
close(file.out)

score 1 · Answer 1 · answered Jul 19 '13 at 14:07

One thing wrong that jumps out it that you are missing a lines <- readLines(file.in, n=1) inside your while loop. You are now stuck in an infinite loop. Also, reading only one line at a time is going to be terribly slow.

If in your file (unlike the one in the example you linked to) every row contains the same number of columns, you could use my LaF package. This should result in something along the lines of:

library(LaF)
m <- detect_dm_csv("npidata_20050523-20130707.csv", header=TRUE)
laf <- laf_open(m)
begin(laf)
con <- file("Mainoutnpidata.txt", 'wt')
while(TRUE) {
  d <- next_block(laf, columns = c(1:11, 23:25, 31:33, 308:311))
  if (nrow(d) == 0) break;
  write.csv(d, file=con, row.names=FALSE, header=FALSE)
}
close(con)
close(laf)

If your 30 columns fit into memory you could even do:

library(LaF)
m <- detect_dm_csv("npidata_20050523-20130707.csv", header=TRUE)
laf <- laf_open(m)
d <- laf[, c(1:11, 23:25, 31:33, 308:311)]
close(laf)

I couldn't test the code above on your file, so can't guarantee there are no errors (let me know if there are).

Trimming big data

1 Answers1