I want to read parts from a large (ca. 11 GB) binary file. The currently working solution is to load the entire file ( raw_data
) with fread()
, then crop out pieces of interest ( data
).
Question: Is there a faster method of reading small (1-2% of total file, partially sequential reads) parts of a file, given something like a binary mask (i.e. a logical index of specific bytes of interst) in Matlab? Specifics below.
Notes for my specific case:
data
of interest (26+e6 bytes, or ca. 24 MB) is roughly 2% ofraw_data
(1.2e+10 bytes or ca. 11 GB)- each 600.000 bytes contain ca 6.500 byte reads, which can be broken down to roughly 1.200 read-skip cycles (such as 'read 10 bytes, skip 5000 bytes').
- the read instructions of the total file can be broken down in ca 20.000 similar but (not exactly identical) read-skip cycles (i.e. ca. 20.000x1.200 read-skip cycles)
- The file is read from a GPFS (parallel file system)
- Excessive RAM, newest Matlab ver and all toolboxes are available for the task
My initial idea of fread-fseek cycle proved to be extrodinarily much slower (see psuedocode below) than reading the whole file. Profiling revealed fread()
is slowest (being called over a million times probably obvious to the experts here).
Alternatives I considered: memmapfile()
[ ref ] has no feasible read multiple small parts as far as I could find. The MappedTensor library might be the next thing I'd look into. Related but didn't help, just to link to article: 1, 2.
%open file
fi=fopen('data.bin');
%example read-skip data
f_reads = [20 10 6 20 40]; %read this number of bytes
f_skips = [900 6000 40 300 600]; %skip these bytes after each read instruction
data = []; %save the result here
fseek(fi,90000,'bof'); %skip initial bytes until first read
%read the file
for ind=1:nbr_read_skip_cylces-1
tmp_data = fread(fi,f_reads(ind));
data = [data; tmp_data]; %add newly read bytes to data variable
fseek(fi,f_skips(ind),'cof'); %skip to next read position
end
FYI: To get an overview and for transparency, I've compiled some plots (below) of the first ca 6.500 read locations (of my actual data) that, after collapsing into fread-fseek pairs can, can be summarized in 1.200 fread-fseek pairs.