I have a very large data file in R (in Giga), If I try to open it with R , I will get an out of memory error.
I need to read the file line by line and do some analysis. I found a previous question on this issue where the file was read by n-lines and jump to certain lines with clump. I have used the answer by "Nick Sabbe" and added some modifications to fit my need.
Consider that I have the following test.csv file-sample of the file:
A B C
200 19 0.1
400 18 0.1
300 29 0.1
800 88 0.1
600 80 0.1
150 50 0.1
190 33 0.1
270 42 0.1
900 73 0.1
730 95 0.1
I want to read the content of the file line by line and perform my analysis. So I have create the following loop to read based on the code posted by"Nick Sabbe". I have two problems: 1) The header is printed for each time I'm printing new line. 2) The index "X" column by R is also printed although I'm deleting this column.
Here is the code I'm using:
test<-function(){
prev<-0
for(i in 1:100){
j<-i-prev
test1<-read.clump("file.csv",j,i)
print(test1)
prev<-i
}
}
####################
# Code by Nick Sabbe
###################
read.clump <- function(file, lines, clump, readFunc=read.csv,
skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
nrows=lines,header=TRUE,...){
if(clump > 1){
colnms<-NULL
if(header)
{
colnms<-unlist(readFunc(file, nrows=1, header=F))
#print(colnms)
}
p = readFunc(file, skip = skip,
nrows = nrows, header=FALSE,...)
if(! is.null(colnms))
{
colnames(p) = colnms
}
} else {
p = readFunc(file, skip = skip, nrows = nrows, header=header)
}
p$X<-NULL # Note: Here I'm setting the index to NULL
return(p)
}
The output I'm getting:
A B C
1 200 19 0.1
NA 1 1 1
1 2 400 18 0.1
NA 1 1 1
1 3 300 29 0.1
NA 1 1 1
1 4 800 88 0.1
NA 1 1 1
1 5 600 80 0.1
I want to get rid of for the rest of reading:
NA 1 1 1
Also, is there any way to make the for loop stop when end of file such EOF in other language???