I run Matlab R2011b and R version 2.13.1 on Linux Mint v12 with 16 GB of RAM.
I have a csv file. The first 5 rows (and header) is:
#RIC,Date[G],Time[G],GMT Offset,Type,Price,Volume
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.68,1008
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.66,300
DAEG.OQ,07-JUL-2011,15:10:03.424,-4,Trade,1.65,1000
DAEG.OQ,07-JUL-2011,15:10:03.464,-4,Trade,1.65,3180
The file is large (approx 900MB). Given the combination of character and numeric data, one might read this file into matlab as follows:
fid1 = fopen('/home/MyUserName/Temp/X.csv');
D = textscan(fid1, '%s%s%s%f%s%f%f', 'Delimiter', ',', 'HeaderLines', 1);
fclose(fid1);
Although the file is 900MB, when running the above code, System Monitor indicates my RAM usage jumps from about 2GB to 10GB. Worse, if I attempt this same procedure with a slightly larger csv file (about 1.2 GB) my RAM maxes out at 16GB and Matlab never manages to finish reading in the data (it just stays stuck in "busy" mode).
If I wanted to read the same file into R, I might use:
D <- read.csv("/home/MyUserName/Temp/X.csv", stringsAsFactors=FALSE)
This takes a bit longer than Matlab, but system monitor indicates my RAM usage only jumps from 2GB to 3.3GB (much more reasonable given the original file size).
My question has two parts:
1) Why is textscan
such a memory hog in this scenario?
2) Is there another approach I could use to get a 1.2GB csv file of this type into Matlab on my system without maxing out the RAM?
EDIT: Just to clarify, I'm curious as to whether there exists a matlab-only solution, ie I'm not interested in a solution that involves using a different language to break up the csv file into smaller chunks (as this is what I'm already doing). Sorry Trav1s, I should have made this clear from the start.