I'm trying to analyze a very large file using textscan
in MATLAB. The file in question is about 12 GB in size and contains about 250 million lines with seven (floating) numbers in each (delimited by a whitespace); because this obviously would not fit into the RAM of my desktop, I'm using the approach suggested in the MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).
To debug, I attempted to go to several positions in the file using the fseek
function, for example like this:
fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');
I'm assuming that the way I'm using fseek
here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan
twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10
by fileInfo.bytes/2
(i.e. moving it to about 50% of the file) everything breaks down and textscan
only returns an empty 1x7 cell.
I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for textscan
to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated!
EDIT
The relevant part of my code used to look like this:
while ~feof(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
%// do some stuff
end
First I tried fixing it using ftell
and fseek
as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines
solution (also suggested below), like this:
i = 0;
while ~feof(fid)
frewind(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
%// do some stuff
i = i + 1;
end
This seems to read in the data without producing errors; it is, however, incredibly slow.
I'm not entirely sure I understand what HeaderLines
does in this context, but it seems to make textscan
completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan
in the "appropriate" way (either with or without ftell
and fseek
): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.