Read Large File line by line in R without header

Question

I have a very large data file in R (in Giga), If I try to open it with R , I will get an out of memory error.

I need to read the file line by line and do some analysis. I found a previous question on this issue where the file was read by n-lines and jump to certain lines with clump. I have used the answer by "Nick Sabbe" and added some modifications to fit my need.

Consider that I have the following test.csv file-sample of the file:

A    B    C
200 19  0.1
400 18  0.1
300 29  0.1
800 88  0.1
600 80  0.1
150 50  0.1
190 33  0.1
270 42  0.1
900 73  0.1
730 95  0.1

I want to read the content of the file line by line and perform my analysis. So I have create the following loop to read based on the code posted by"Nick Sabbe". I have two problems: 1) The header is printed for each time I'm printing new line. 2) The index "X" column by R is also printed although I'm deleting this column.

Here is the code I'm using:

test<-function(){
 prev<-0

for(i in 1:100){
  j<-i-prev
  test1<-read.clump("file.csv",j,i)
  print(test1)
  prev<-i

}
}
####################
# Code by Nick Sabbe
###################
read.clump <- function(file, lines, clump, readFunc=read.csv,
                   skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
                   nrows=lines,header=TRUE,...){
if(clump > 1){
colnms<-NULL
if(header)
{
  colnms<-unlist(readFunc(file, nrows=1, header=F))
  #print(colnms)
}
p = readFunc(file, skip = skip,
             nrows = nrows, header=FALSE,...)
if(! is.null(colnms))
{
  colnames(p) = colnms
}
} else {
 p = readFunc(file, skip = skip, nrows = nrows, header=header)
}
p$X<-NULL   # Note: Here I'm setting the index to NULL
return(p)
}

The output I'm getting:

       A       B    C
1      200      19   0.1
  NA   1       1     1
1  2   400     18   0.1
  NA   1       1    1
1  3   300     29   0.1
  NA   1       1    1
1  4   800     88   0.1
  NA   1       1    1
1  5   600     80   0.1

I want to get rid of for the rest of reading:

 NA   1       1     1

Also, is there any way to make the for loop stop when end of file such EOF in other language???

This seems incredibly inefficient. Is it absolutely essential that you do this line by line and using `for` iterators? Surely you can make life easier by using vectorized computation in R? — n.e.w, Dec 04 '12 at 21:32

agstudy · Accepted Answer · 2012-12-05T00:16:46.440

5

Maybe something like this can help you :

inputFile <- "foo.txt"
con  <- file(inputFile, open = "r")
while (length(oneLine <- readLines(con, n = 1)) > 0) {
  myLine <- unlist((strsplit(oneLine, ",")))
  print(myLine)
} 
close(con)

or with scan to avoid splitting as @MatthewPlourde

I use scan : I skip the header, and quiet = TRUE to not have message saying how many items have been

while (length(myLine <- scan(con,what="numeric",nlines=1,sep=',',skip=1,quiet=TRUE)) > 0 ){
   ## here I print , but you must have a process your line here
   print(as.numeric(myLine))

}

edited Dec 05 '12 at 00:16

answered Dec 04 '12 at 21:50

agstudy

119,832
17
199
261

+1 This is the way I do it. You could use `scan` instead of `readLines`, and avoid splitting. – Matthew Plourde Dec 04 '12 at 21:53
@MatthewPlourde I update my answer , but I wonder if readlines is more efficient. – agstudy Dec 04 '12 at 22:12
I have tried the answer. I have noticed that the results are in string type not numeric type. also how I can get rid of the read mssge: (Read 3 items). Sample of output [1] "A " "B" "C" Read 3 items [1] "200" "19" "0.1" Read 3 items Numbers are between " " and read statistics. Thanks for your response... – SimpleNEasy Dec 04 '12 at 23:57
@Eng.Mohd I update my message but I advise you to read the help ?scan – agstudy Dec 05 '12 at 00:17
I can change the "what= integer() or numeric()" but the problem will be with header of the column ????? – SimpleNEasy Dec 05 '12 at 00:47
Thank you I just saw the update. However, the code starts to skip lines. This is the output: [1] 200.0 19.0 0.1 [1] 300.0 29.0 0.1 [1] 600.0 80.0 0.1 [1] 190.0 33.0 0.1 [1] 900.0 73.0 0.1 – SimpleNEasy Dec 05 '12 at 01:28

score 0 · Answer 2 · answered Nov 01 '18 at 22:35

0

I suggest you check out chunked and disk.frame. They both have functions for reading in CSVs.

disk.frame::csv_to_disk.frame might be the function you want.

answered Nov 01 '18 at 22:35

xiaodai

14,889
18
76
140

Read Large File line by line in R without header

The output I'm getting:

2 Answers2

Linked