Textscan on file with large number of lines

Question

I'm trying to analyze a very large file using textscan in MATLAB. The file in question is about 12 GB in size and contains about 250 million lines with seven (floating) numbers in each (delimited by a whitespace); because this obviously would not fit into the RAM of my desktop, I'm using the approach suggested in the MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).

To debug, I attempted to go to several positions in the file using the fseek function, for example like this:

fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');

I'm assuming that the way I'm using fseek here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10 by fileInfo.bytes/2 (i.e. moving it to about 50% of the file) everything breaks down and textscan only returns an empty 1x7 cell.

I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for textscan to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated!

EDIT

The relevant part of my code used to look like this:

while ~feof(fid)
    data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
        %// do some stuff
end

First I tried fixing it using ftell and fseek as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines solution (also suggested below), like this:

i = 0;
while ~feof(fid)
    frewind(fid)
    data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
        %// do some stuff
    i = i + 1;
end

This seems to read in the data without producing errors; it is, however, incredibly slow.

I'm not entirely sure I understand what HeaderLines does in this context, but it seems to make textscan completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan in the "appropriate" way (either with or without ftell and fseek): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.

Do you have the ability to change the process that writes the text file and change it to output a binary instead? I know this is not an answer to your problem but it provides an alternate path to get at the information you need. — Matt, Aug 19 '15 at 16:55
@Matt Unfortunately not, I'm trying to analyze the output of a third-party software package which can only output in plain text. — Julius, Aug 20 '15 at 09:36

Hoki · Accepted Answer · 2015-08-19T16:37:37.120

fseek a pointer in a file is only good when you know precisely where (or by how many bytes) you want to move the cursor. It is very useful for binary files when you just want to skip some records of known length. But on a text file it is more dangerous and confusing than anything (unless you are absolutely sure that each line is the same size and each element on the line is at the same exact place/column, but that doesn't happen often).

There are several ways to read a text file block by block:

1) Use the `HeaderLines` option

To simply skip a block of lines on a text file, you can use the HeaderLines parameter of textscan, so for example:

readFormat = '%f %f %f %f %f %f %f' ;   %// read format specifier
nLines = 10000 ;                        %// number of line to read per block

fileInfo = dir(fileName);

%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
    %// Now do something with your "M" data

Then when you want to read the second block:

%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)

And if you have many blocks, for the Nth block, just adapt:

%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)

If necessary (if you have many blocks), just code this last version in a loop.

Note that this is good if you close your file after each block reading (so the file pointer will start at the beginning of the file when you open it again). Closing the file after reading a block of data is safer if your processing might take a long time or may error out (you don't want to have files which remain open too long or loose the fid if you crash).

2) Read by block (without closing the file)

If the processing of the block is quick and safe enough so you're sure it won't bomb out, you could afford to not close the file. In this case, the textscan file pointer will stay where you stopped, so you could also :

read a block (do not close the file): M = textscan(fid, readFormat, nLines)
Process it then save your result (and release memory)
read the next block with the same call: M = textscan(fid, readFormat, nLines)

In this case you wouldn't need the headerlines parameter because textscan will resume reading exactly where it stopped.

3) use `ftell` and `fseek`

Lastly, you could use fseek to start reading the file at the precise position you want, but in this case I recommend using it in conjunction with ftell.

ftell will return the current position in an open file, so use that to know at which position you stop reading last, then use fseek the next time to go straight at this position. Something like:

%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)

%// do some stuff

%// then read another block:
fid = fileopen(fileName);
fseek( fid , 'bof' , lastPosition ) ;
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// and so on ...

Hi Hoki, thanks for your answer! I'll try your `HeaderLines` suggestion, maybe that works. I'd started out trying what you suggested at the end of your answer (scanning and processing blocks by using `textscan` repeatedly), however, this approach caused some malfunction about 43% of the way through processing the file (from then on only empty cells were being read, despite there being data in the file itself). I was hoping to get some clarification on this issue, I'm sorry if this wasn't clear from my original question. — Julius, Aug 19 '15 at 14:11
@Julius, you may have one _corrupted_ line somewhere in your file which makes `textscan` fail. Try to use the `'EmptyValue'` and/or `'TreatAsEmpty'` parameters of `textscan`, they could save you stumbling mid way through your file just because a typo is there. — Hoki, Aug 19 '15 at 14:27
I tried your `HeaderLines` suggestion to look at the block where reading the file fails (I determined this to be at around 43% of the way down). I made sure to be at the beginning of the file using `frewind`, and then scanned 5% of the total number of lines starting from the 40% mark (set using the `HeaderLines` option). This returned a cell array containing the data I expect to see, without there being any empty cells. To me this suggests that it is not a corrupted line causing my problem, but rather that something goes wrong with `textscan` internally. Do you have any thoughts on this? — Julius, Aug 20 '15 at 09:33
I don't see any reason why `textscan` would fail on reading large block of data (apart from memory limitation issues) but I'm not expert enough to know everything about it. The fact that you can read this section of the file properly seems to show that reading the file in blocks works. So implement a solution around that (read the file in 10 blocks of ~10% for example). If you really want to get to the bottom of it, you could try splitting your file in 2 parts, then 3 parts etc ... and see if the problem persist. — Hoki, Aug 20 '15 at 11:55
Bottom line ... I don't think handling 12GB text file is a good idea anyway. I would recommend splitting the file in chunks at least <4GB . Or more chunks if necessary, just choose a reason for the split that make sense with the data (some time interval for example). — Hoki, Aug 20 '15 at 11:58

Textscan on file with large number of lines

1 Answers1

1) Use the `HeaderLines` option

2) Read by block (without closing the file)

3) use `ftell` and `fseek`

Linked

Textscan on file with large number of lines

1 Answers1

1) Use the HeaderLines option

2) Read by block (without closing the file)

3) use ftell and fseek

Linked

1) Use the `HeaderLines` option

3) use `ftell` and `fseek`