NOTE: This question deals with an issue observed back in 2011 with an old MATLAB version (R2009a). As per the update below from July 2016, the issue/bug in MATLAB seems to no longer exist (tested with R2016a; scroll down to end of question to see update).
I am using MATLAB R2009b and I need to write a larger script that converts the contents of a larger set of .zip files to v7.3 mat files (with an underlying HDF5-datamodel). Reading is OK. The issue is with saving. And there is actually no problem. My files saves nicely using the save command.
My question is more in the sense: Why am I observing the following surprising (for me) behavior in MATLAB?
let's look at my issue in general. In this current test-scenario I will be generating one output: A -v7.3 mat-file. This .mat-file will contain 40 blocks as individual variables. Each variable will be named "block_NNN" from 1 to 40 and will contain a struct with fields frames and blockNo. Field frames contains a 480x240x65 sequence of uint8 imagedata (here just random data generated using randi). Field blockNo contains the block number.
Remark: In the real script (that I have yet to finish) I will be doing the above at total of 370 times, converting a total of 108GB of raw data. Which is why I am concerned with the following.
Anyway, first I define some general variables:
% some sizes for dummy data and loops: num_blockCount = 40; num_blockLength = 65; num_frameHeight = 480; num_frameWidth = 240;
I then generate some dummy code that has shape and size identical to the actual raw data:
% generate empty struct: stu_data2disk = struct(); % loop over blocks: for num_k = 1:num_blockCount % generate block-name: temp_str_blockName = sprintf('block_%03u', num_k); % generate temp struct for current block: temp_stu_value = struct(); temp_stu_value.frames = randi( ... [0 255], ... [num_frameHeight num_frameWidth num_blockLength], ... 'uint8' ... ); temp_stu_value.blockNo = num_k; % using dynamic field names: stu_data2disk.(sprintf('block_%03u', num_k)) = temp_stu_value; end
I now have all my random test-data in a struct stu_data2disk. Now I would like to save the data using one of two possible methods.
Let's try the simple one first:
% save data (simple): disp('Save data the simple way:') tic; save converted.mat -struct stu_data2disk -v7.3; toc;
The file is written without problems (286MB). The output is:
Save data the simple way: Elapsed time is 14.004449 seconds.
OK - then I remembered that I would like to follow the save-procedure over the 40 blocks. Thus instead of the above I loop over the blocks and append them in sequence:
% save to file, using append: disp('Save data using -append:') tic; for num_k = 1:num_blockCount % generate block-name: temp_str_blockName = sprintf('block_%03u', num_k); temp_str_appendToggle = ''; if (num_k > 1) temp_str_appendToggle = '-append'; end % generate save command: temp_str_saveCommand = [ ... 'save ', ... 'converted_append.mat ', ... '-struct stu_data2disk ', temp_str_blockName, ' '... temp_str_appendToggle, ' ', ... '-v7.3', ... ';' ... ]; % evaluate save command: eval(temp_str_saveCommand); end toc;
And again the file saves nicely (286MB). The output is:
Save data using -append: Elapsed time is 0.956968 seconds.
Interestingly the append-method is much faster? My question is why?
Output from dir converted*.mat
:
09-02-2011 20:38 300,236,392 converted.mat 09-02-2011 20:37 300,264,316 converted_append.mat 2 File(s) 600,500,708 bytes
The files are not identical in size. And a test with fc in windows 7 revealed ... well many binary differences. Perhaps the data was shifted a bit - thus this tells us nothing.
Does someone have an idea what is going on here? Is the appended file using a much more optimized data-structure perhaps? Or maybe windows has cached the file and makes access to it much faster?
I made the effort of test-reading from the two files as well. Without presenting the numbers here the appended version was a little bit faster (could mean something in the long run though).
[EDIT]: I just tried using no format flag (defaults to -v7 on my system) and there is not much difference anymore:
Save data the simple way (-v7): Elapsed time is 13.092084 seconds. Save data using -append (-v7): Elapsed time is 14.345314 seconds.
[EDIT]: I corrected the above mistake. Previously I mentioned that the stats were for -v6 but I was mistaken. I had just removed the format flag and assumed the default was -v6 but actually it is -v7.
I have created new test stats for all formats on my system using Andrew's fine framework (all formats are for the same random test data, now read from file):
15:15:51.422: Testing speed, format=-v6, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional 6.1.7600 N/A Build 7600 15:16:00.829: Save the simple way: 0.358 sec 15:16:01.188: Save using multiple append: 7.432 sec 15:16:08.614: Save using one big append: 1.161 sec 15:16:24.659: Testing speed, format=-v7, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional 6.1.7600 N/A Build 7600 15:16:33.442: Save the simple way: 12.884 sec 15:16:46.329: Save using multiple append: 14.442 sec 15:17:00.775: Save using one big append: 13.390 sec 15:17:31.579: Testing speed, format=-v7.3, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional 6.1.7600 N/A Build 7600 15:17:40.690: Save the simple way: 13.751 sec 15:17:54.434: Save using multiple append: 3.970 sec 15:17:58.412: Save using one big append: 6.138 sec
And the sizes of the files:
10-02-2011 15:16 299,528,768 converted_format-v6.mat 10-02-2011 15:16 299,528,768 converted_append_format-v6.mat 10-02-2011 15:16 299,528,832 converted_append_batch_format-v6.mat 10-02-2011 15:16 299,894,027 converted_format-v7.mat 10-02-2011 15:17 299,894,027 converted_append_format-v7.mat 10-02-2011 15:17 299,894,075 converted_append_batch_format-v7.mat 10-02-2011 15:17 300,236,392 converted_format-v7.3.mat 10-02-2011 15:17 300,264,316 converted_append_format-v7.3.mat 10-02-2011 15:18 300,101,800 converted_append_batch_format-v7.3.mat 9 File(s) 2,698,871,005 bytes
Thus -v6 seems to be the fastest for writing. Also not any large differences in files sizes. HDF5 does have some basic inflate-method built-in as far as I know.
Hmm, probably some optimization in the underlying HDF5-write functions?
Currently I still think that some underlying fundamental HDF5-write function is optimized for adding datasets to an HDF5-file (which is what happens when adding new variables to a -7.3 file). I believe I have read somewhere that HDF5 should optimized in this very way... though cannot be sure.
Other details to note:
The behavior is very systemic as we see in Andrew's answer below. It also seems to be quite important as to whether or not you run these things in a local scope of a function or in the "global" of an m-script. My first results were from an m-script where files were written to the current directory. I can still only reproduce the 1-second write for -7.3 in the m-script. The function-calls add some overhead apparently.
Update July 2016:
I found this again and thought I might test it with the newest MATLAB available to me at the moment. With MATLAB R2016a on Windows 7 x64 the problem seems to have been fixed:
14:04:06.277: Testing speed, imax=255, R2016a on PCWIN64, arch=AMD64, 16 GB, os=Microsoft Windows 7 Enterprise Version 6.1 (Build 7601: Service Pack 1) 14:04:10.600: basic -v7.3: 7.599 sec 5.261 GB used 14:04:18.229: basic -v7.3: 7.894 sec 5.383 GB used 14:04:26.154: basic -v7.3: 7.909 sec 5.457 GB used 14:04:34.096: basic -v7.3: 7.919 sec 5.498 GB used 14:04:42.048: basic -v7.3: 7.886 sec 5.516 GB used 286 MB file 7.841 sec mean 14:04:50.581: multiappend -v7.3: 7.928 sec 5.819 GB used 14:04:58.544: multiappend -v7.3: 7.905 sec 5.834 GB used 14:05:06.485: multiappend -v7.3: 8.013 sec 5.844 GB used 14:05:14.542: multiappend -v7.3: 8.591 sec 5.860 GB used 14:05:23.168: multiappend -v7.3: 8.059 sec 5.868 GB used 286 MB file 8.099 sec mean 14:05:31.913: bigappend -v7.3: 7.727 sec 5.837 GB used 14:05:39.676: bigappend -v7.3: 7.740 sec 5.879 GB used 14:05:47.453: bigappend -v7.3: 7.645 sec 5.884 GB used 14:05:55.133: bigappend -v7.3: 7.656 sec 5.877 GB used 14:06:02.824: bigappend -v7.3: 7.963 sec 5.871 GB used 286 MB file 7.746 sec mean
This was tested with Andrew Janke's reproMatfileAppendSpeedup
function in the accepted answer below (5 passes with format 7.3). Now, -append
is equally slow, or slower, to a single save - as it should be. Perhaps it was a problem with an early build of the HDF5 driver used in R2009a.