5

I have a very large data file in R (in Giga), If I try to open it with R , I will get an out of memory error.

I need to read the file line by line and do some analysis. I found a previous question on this issue where the file was read by n-lines and jump to certain lines with clump. I have used the answer by "Nick Sabbe" and added some modifications to fit my need.

Consider that I have the following test.csv file-sample of the file:

A    B    C
200 19  0.1
400 18  0.1
300 29  0.1
800 88  0.1
600 80  0.1
150 50  0.1
190 33  0.1
270 42  0.1
900 73  0.1
730 95  0.1

I want to read the content of the file line by line and perform my analysis. So I have create the following loop to read based on the code posted by"Nick Sabbe". I have two problems: 1) The header is printed for each time I'm printing new line. 2) The index "X" column by R is also printed although I'm deleting this column.

Here is the code I'm using:

test<-function(){
 prev<-0

for(i in 1:100){
  j<-i-prev
  test1<-read.clump("file.csv",j,i)
  print(test1)
  prev<-i

}
}
####################
# Code by Nick Sabbe
###################
read.clump <- function(file, lines, clump, readFunc=read.csv,
                   skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
                   nrows=lines,header=TRUE,...){
if(clump > 1){
colnms<-NULL
if(header)
{
  colnms<-unlist(readFunc(file, nrows=1, header=F))
  #print(colnms)
}
p = readFunc(file, skip = skip,
             nrows = nrows, header=FALSE,...)
if(! is.null(colnms))
{
  colnames(p) = colnms
}
} else {
 p = readFunc(file, skip = skip, nrows = nrows, header=header)
}
p$X<-NULL   # Note: Here I'm setting the index to NULL
return(p)
}

The output I'm getting:

       A       B    C
1      200      19   0.1
  NA   1       1     1
1  2   400     18   0.1
  NA   1       1    1
1  3   300     29   0.1
  NA   1       1    1
1  4   800     88   0.1
  NA   1       1    1
1  5   600     80   0.1

I want to get rid of for the rest of reading:

 NA   1       1     1

Also, is there any way to make the for loop stop when end of file such EOF in other language???

SimpleNEasy
  • 879
  • 3
  • 11
  • 32
  • This seems incredibly inefficient. Is it absolutely essential that you do this line by line and using `for` iterators? Surely you can make life easier by using vectorized computation in R? – n.e.w Dec 04 '12 at 21:32

2 Answers2

5

Maybe something like this can help you :

inputFile <- "foo.txt"
con  <- file(inputFile, open = "r")
while (length(oneLine <- readLines(con, n = 1)) > 0) {
  myLine <- unlist((strsplit(oneLine, ",")))
  print(myLine)
} 
close(con)

or with scan to avoid splitting as @MatthewPlourde

I use scan : I skip the header, and quiet = TRUE to not have message saying how many items have been

while (length(myLine <- scan(con,what="numeric",nlines=1,sep=',',skip=1,quiet=TRUE)) > 0 ){
   ## here I print , but you must have a process your line here
   print(as.numeric(myLine))

} 
agstudy
  • 119,832
  • 17
  • 199
  • 261
  • +1 This is the way I do it. You could use `scan` instead of `readLines`, and avoid splitting. – Matthew Plourde Dec 04 '12 at 21:53
  • @MatthewPlourde I update my answer , but I wonder if readlines is more efficient. – agstudy Dec 04 '12 at 22:12
  • I have tried the answer. I have noticed that the results are in string type not numeric type. also how I can get rid of the read mssge: (Read 3 items). Sample of output [1] "A " "B" "C" Read 3 items [1] "200" "19" "0.1" Read 3 items Numbers are between " " and read statistics. Thanks for your response... – SimpleNEasy Dec 04 '12 at 23:57
  • @Eng.Mohd I update my message but I advise you to read the help ?scan – agstudy Dec 05 '12 at 00:17
  • I can change the "what= integer() or numeric()" but the problem will be with header of the column ????? – SimpleNEasy Dec 05 '12 at 00:47
  • Thank you I just saw the update. However, the code starts to skip lines. This is the output: [1] 200.0 19.0 0.1 [1] 300.0 29.0 0.1 [1] 600.0 80.0 0.1 [1] 190.0 33.0 0.1 [1] 900.0 73.0 0.1 – SimpleNEasy Dec 05 '12 at 01:28
0

I suggest you check out chunked and disk.frame. They both have functions for reading in CSVs.

disk.frame::csv_to_disk.frame might be the function you want.

xiaodai
  • 14,889
  • 18
  • 76
  • 140