11

NOTE: This question deals with an issue observed back in 2011 with an old MATLAB version (R2009a). As per the update below from July 2016, the issue/bug in MATLAB seems to no longer exist (tested with R2016a; scroll down to end of question to see update).

I am using MATLAB R2009b and I need to write a larger script that converts the contents of a larger set of .zip files to v7.3 mat files (with an underlying HDF5-datamodel). Reading is OK. The issue is with saving. And there is actually no problem. My files saves nicely using the save command.

My question is more in the sense: Why am I observing the following surprising (for me) behavior in MATLAB?

let's look at my issue in general. In this current test-scenario I will be generating one output: A -v7.3 mat-file. This .mat-file will contain 40 blocks as individual variables. Each variable will be named "block_NNN" from 1 to 40 and will contain a struct with fields frames and blockNo. Field frames contains a 480x240x65 sequence of uint8 imagedata (here just random data generated using randi). Field blockNo contains the block number.

Remark: In the real script (that I have yet to finish) I will be doing the above at total of 370 times, converting a total of 108GB of raw data. Which is why I am concerned with the following.

Anyway, first I define some general variables:

% some sizes for dummy data and loops:
num_blockCount = 40;
num_blockLength = 65;
num_frameHeight = 480;
num_frameWidth = 240;

I then generate some dummy code that has shape and size identical to the actual raw data:

% generate empty struct:
stu_data2disk = struct();

% loop over blocks:
for num_k = 1:num_blockCount

   % generate block-name:
   temp_str_blockName = sprintf('block_%03u', num_k);

   % generate temp struct for current block:
   temp_stu_value = struct();
   temp_stu_value.frames = randi( ...
      [0 255], ...
      [num_frameHeight num_frameWidth num_blockLength], ...
      'uint8' ...
   );
   temp_stu_value.blockNo = num_k;

   % using dynamic field names:
   stu_data2disk.(sprintf('block_%03u', num_k)) = temp_stu_value;

end

I now have all my random test-data in a struct stu_data2disk. Now I would like to save the data using one of two possible methods.

Let's try the simple one first:

% save data (simple):
disp('Save data the simple way:')
tic;
save converted.mat -struct stu_data2disk -v7.3;
toc;

The file is written without problems (286MB). The output is:

Save data the simple way:
Elapsed time is 14.004449 seconds.

OK - then I remembered that I would like to follow the save-procedure over the 40 blocks. Thus instead of the above I loop over the blocks and append them in sequence:

% save to file, using append:
disp('Save data using -append:')
tic;
for num_k = 1:num_blockCount

   % generate block-name:
   temp_str_blockName = sprintf('block_%03u', num_k);

   temp_str_appendToggle = '';
   if (num_k > 1)
      temp_str_appendToggle = '-append';
   end

   % generate save command:
   temp_str_saveCommand = [ ...
      'save ', ...
      'converted_append.mat ', ...
      '-struct stu_data2disk ', temp_str_blockName, ' '...
      temp_str_appendToggle, ' ', ...
      '-v7.3', ...
      ';' ...
   ];

   % evaluate save command:
   eval(temp_str_saveCommand);

end
toc;

And again the file saves nicely (286MB). The output is:

Save data using -append:
Elapsed time is 0.956968 seconds.

Interestingly the append-method is much faster? My question is why?

Output from dir converted*.mat:

09-02-2011  20:38       300,236,392 converted.mat
09-02-2011  20:37       300,264,316 converted_append.mat
               2 File(s)    600,500,708 bytes

The files are not identical in size. And a test with fc in windows 7 revealed ... well many binary differences. Perhaps the data was shifted a bit - thus this tells us nothing.

Does someone have an idea what is going on here? Is the appended file using a much more optimized data-structure perhaps? Or maybe windows has cached the file and makes access to it much faster?

I made the effort of test-reading from the two files as well. Without presenting the numbers here the appended version was a little bit faster (could mean something in the long run though).

[EDIT]: I just tried using no format flag (defaults to -v7 on my system) and there is not much difference anymore:

Save data the simple way (-v7):
Elapsed time is 13.092084 seconds.
Save data using -append (-v7):
Elapsed time is 14.345314 seconds.

[EDIT]: I corrected the above mistake. Previously I mentioned that the stats were for -v6 but I was mistaken. I had just removed the format flag and assumed the default was -v6 but actually it is -v7.

I have created new test stats for all formats on my system using Andrew's fine framework (all formats are for the same random test data, now read from file):

15:15:51.422: Testing speed, format=-v6, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
15:16:00.829: Save the simple way:            0.358 sec
15:16:01.188: Save using multiple append:     7.432 sec
15:16:08.614: Save using one big append:      1.161 sec

15:16:24.659: Testing speed, format=-v7, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
15:16:33.442: Save the simple way:           12.884 sec
15:16:46.329: Save using multiple append:    14.442 sec
15:17:00.775: Save using one big append:     13.390 sec

15:17:31.579: Testing speed, format=-v7.3, R2009b on PCWIN, arch=x86, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
15:17:40.690: Save the simple way:           13.751 sec
15:17:54.434: Save using multiple append:     3.970 sec
15:17:58.412: Save using one big append:      6.138 sec

And the sizes of the files:

10-02-2011  15:16       299,528,768 converted_format-v6.mat
10-02-2011  15:16       299,528,768 converted_append_format-v6.mat
10-02-2011  15:16       299,528,832 converted_append_batch_format-v6.mat
10-02-2011  15:16       299,894,027 converted_format-v7.mat
10-02-2011  15:17       299,894,027 converted_append_format-v7.mat
10-02-2011  15:17       299,894,075 converted_append_batch_format-v7.mat
10-02-2011  15:17       300,236,392 converted_format-v7.3.mat
10-02-2011  15:17       300,264,316 converted_append_format-v7.3.mat
10-02-2011  15:18       300,101,800 converted_append_batch_format-v7.3.mat
               9 File(s)  2,698,871,005 bytes

Thus -v6 seems to be the fastest for writing. Also not any large differences in files sizes. HDF5 does have some basic inflate-method built-in as far as I know.

Hmm, probably some optimization in the underlying HDF5-write functions?

Currently I still think that some underlying fundamental HDF5-write function is optimized for adding datasets to an HDF5-file (which is what happens when adding new variables to a -7.3 file). I believe I have read somewhere that HDF5 should optimized in this very way... though cannot be sure.

Other details to note:

The behavior is very systemic as we see in Andrew's answer below. It also seems to be quite important as to whether or not you run these things in a local scope of a function or in the "global" of an m-script. My first results were from an m-script where files were written to the current directory. I can still only reproduce the 1-second write for -7.3 in the m-script. The function-calls add some overhead apparently.

Update July 2016:

I found this again and thought I might test it with the newest MATLAB available to me at the moment. With MATLAB R2016a on Windows 7 x64 the problem seems to have been fixed:

14:04:06.277: Testing speed, imax=255, R2016a on PCWIN64, arch=AMD64, 16 GB, os=Microsoft Windows 7 Enterprise  Version 6.1 (Build 7601: Service Pack 1)
14:04:10.600: basic -v7.3:                    7.599 sec      5.261 GB used
14:04:18.229: basic -v7.3:                    7.894 sec      5.383 GB used
14:04:26.154: basic -v7.3:                    7.909 sec      5.457 GB used
14:04:34.096: basic -v7.3:                    7.919 sec      5.498 GB used
14:04:42.048: basic -v7.3:                    7.886 sec      5.516 GB used     286 MB file   7.841 sec mean
14:04:50.581: multiappend -v7.3:              7.928 sec      5.819 GB used
14:04:58.544: multiappend -v7.3:              7.905 sec      5.834 GB used
14:05:06.485: multiappend -v7.3:              8.013 sec      5.844 GB used
14:05:14.542: multiappend -v7.3:              8.591 sec      5.860 GB used
14:05:23.168: multiappend -v7.3:              8.059 sec      5.868 GB used     286 MB file   8.099 sec mean
14:05:31.913: bigappend -v7.3:                7.727 sec      5.837 GB used
14:05:39.676: bigappend -v7.3:                7.740 sec      5.879 GB used
14:05:47.453: bigappend -v7.3:                7.645 sec      5.884 GB used
14:05:55.133: bigappend -v7.3:                7.656 sec      5.877 GB used
14:06:02.824: bigappend -v7.3:                7.963 sec      5.871 GB used     286 MB file   7.746 sec mean

This was tested with Andrew Janke's reproMatfileAppendSpeedup function in the accepted answer below (5 passes with format 7.3). Now, -append is equally slow, or slower, to a single save - as it should be. Perhaps it was a problem with an early build of the HDF5 driver used in R2009a.

Ole Thomsen Buus
  • 1,333
  • 1
  • 9
  • 24
  • Surprising. I'd think -append would be slower (see http://stackoverflow.com/questions/4268044/deleting-variables-from-a-mat-file/4270936). Local or network filesystem? Try watching both save operations in Process Explorer or Process Monitor to see how much IO (bytes and operations in and out) each is doing. – Andrew Janke Feb 09 '11 at 21:36
  • Could well be caching. Or maybe -append causes Matlab to open the file for RW in a different fashion (e.g. mem-mapped) which is faster in your case. Or maybe writing the 300MB file in one large chunk interacts with buffers or something and makes it slower. As another test, what happens if you save a trivially small variable to a file, and then append all of your chunks in a single call to "save -append"? And how fast is saving each chunk to a separate file? That might help differentiate between the -append mode and chunk sizes. – Andrew Janke Feb 09 '11 at 21:45
  • @Andrew: Thanks for your comments. It is surprising :) But then again I am happy I took the time writing the loop that did it faster. I will see if I get the time tommorow to make those tests. Bedtime for me though :) – Ole Thomsen Buus Feb 09 '11 at 22:08
  • My guess is that it's the compression phase that is reduced when appending. When appending, Matlab probably compresses each block independantly and maybe it's quicker to compress 40 small blocks than 1 large block of the same total size. – grantnz Feb 09 '11 at 22:10
  • @grantnz: Separate saves aren't supposed to affect compression. Even when not appending, Matlab compresses each variable separately (and by "variable" I think they actually mean each individual primitive array). See the MAT file format reference doc at http://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf, section 1-11. But the "-v6" option turns off compression, and his -v6 test was slower, so could be. @Ole, how big are the -v6 files? – Andrew Janke Feb 09 '11 at 22:44
  • I misspoke - looks like the v7.3 HDF5 format doesn't do compression, and that matfile doco isn't relevant at all, except to explain why there's no difference when doing -v6. Probably some HDF5 optimization like Ole suspects, or a difference in how save() is calling it. – Andrew Janke Feb 09 '11 at 23:49
  • @grantnz: Actually I misspoke as well. I edited my mistake - the stats mentioned for -v6, were for -v7 instead (see the edit above). The save for -v6 is indeed very fast for "the simple method". – Ole Thomsen Buus Feb 10 '11 at 14:49
  • Just a detail regarding compression or no-compression. When I think about it such details become moot when using "Uniformly distributed pseudorandom integers" as generated by **randi**. As far as I understand, such data has high entropy and will get a low compression. Could be more interesting to try out my real image data - which is not all noise :) Though I still think that HDF5 involves compression as well: http://www.mathworks.com/support/solutions/en/data/1-4FLGXO/index.html?product=ML&solution=1-4FLGXO http://www.hdfgroup.org/doc_resource/SZIP/ – Ole Thomsen Buus Feb 10 '11 at 15:14
  • You're right, I erred again: v7.3 mat files do use the HDF5 format's own compression. I confirmed it by running the HDF5 "h5dump -p -H" utility on one of these converted.mat files, and the "frames" datasets show "FILTERS { COMPRESSION DEFLATE { LEVEL 3 } }". The entropy issue hadn't occurred to me though. – Andrew Janke Feb 10 '11 at 21:41
  • Just updated the title of this question. It is more specific now. – Ole Thomsen Buus Feb 12 '11 at 16:15

3 Answers3

8

Holy cow. I can reproduce. Tried the single-append variation too; it's even speedier. Looks like "-append" just magically makes HDF5-based save() 30x faster. I don't have an explanation but I wanted to share what I found.

I wrapped up your test code in a function, refactoring it to make the save logic agnostic about the test data structure so you can run it on other data sets, and added some more diagnostic output.

Don't see the big speedup everywhere. It's huge on my 64-bit XP box and a 32-bit Server 2003 box, big on my 64-bit Windows 7 box, nonexistent on a 32-bit XP box. (Though multiple appends are a huge loss on Server 2003.) R2010b is slower in many cases. Maybe HDF5 appends or save's use of it just rock on newer Windows builds. (XP x64 is actually the Server 2003 kernel.) Or maybe it's just a machine config difference. There's a fast RAID on the XP x64 machine, and the 32-bit XP has less RAM than the rest. What OS and architecture are you running? Can you try this repro too?

19:36:40.289: Testing speed, format=-v7.3, R2009b on PCWIN64, arch=AMD64, os=Microsoft(R) Windows(R) XP Professional x64 Edition 5.2.3790 Service Pack 2 Build 3790
19:36:55.930: Save the simple way:           11.493 sec
19:37:07.415: Save using multiple append:     1.594 sec
19:37:09.009: Save using one big append:      0.424 sec


19:39:21.681: Testing speed, format=-v7.3, R2009b on PCWIN, arch=x86, os=Microsoft Windows XP Professional 5.1.2600 Service Pack 3 Build 2600
19:39:37.493: Save the simple way:           10.881 sec
19:39:48.368: Save using multiple append:    10.187 sec
19:39:58.556: Save using one big append:     11.956 sec


19:44:33.410: Testing speed, format=-v7.3, R2009b on PCWIN64, arch=AMD64, os=Microsoft Windows 7 Professional  6.1.7600 N/A Build 7600
19:44:50.789: Save the simple way:           14.354 sec
19:45:05.156: Save using multiple append:     6.321 sec
19:45:11.474: Save using one big append:      2.143 sec


20:03:37.907: Testing speed, format=-v7.3, R2009b on PCWIN, arch=x86, os=Microsoft(R) Windows(R) Server 2003, Enterprise Edition 5.2.3790 Service Pack 2 Build 3790
20:03:58.532: Save the simple way:           19.730 sec
20:04:18.252: Save using multiple append:    77.897 sec
20:05:36.160: Save using one big append:      0.630 sec

This looks huge. If it holds up on other data sets, I might use this trick in a lot of places myself. It may be something to bring up with MathWorks, too. Could they use the fast append technique in normal saves or other OS versions, too?

Here's the self-contained repro function.

function out = reproMatfileAppendSpeedup(nPasses, tests, imax, formats)
%REPROMATFILEAPPENDSPEEDUP Show how -append makes v7.3 saves much faster
%
% Examples:
% reproMatfileAppendSpeedup()
% reproMatfileAppendSpeedup(2, [], 0, {'7.3','7','6'}); % low-entropy test

if nargin < 1 || isempty(nPasses);  nPasses = 1;  end
if nargin < 2 || isempty(tests);    tests = {'basic','multiappend','bigappend'}; end
if nargin < 3 || isempty(imax);     imax = 255; end
if nargin < 4 || isempty(formats);  formats = '7.3'; end % -v7 and -v6 do not show the speedup
tests = cellstr(tests);
formats = cellstr(formats);

fprintf('%s: Testing speed, imax=%d, R%s on %s\n',...
    timestamp, imax, version('-release'), systemDescription());

tempDir = setupTempDir();
testData = generateTestData(imax);

testMap = struct('basic','saveSimple', 'multiappend','saveMultiAppend', 'bigappend','saveBigAppend');

for iFormat = 1:numel(formats)
    format = formats{iFormat};
    formatFlag = ['-v' format];
    %fprintf('%s: Format %s\n', timestamp, formatFlag);
    for iTest = 1:numel(tests)
        testName = tests{iTest};
        saveFcn = testMap.(testName);
        te = NaN(1, nPasses);
        for iPass = 1:nPasses
            fprintf('%s: %-30s', timestamp, [testName ' ' formatFlag ':']);
            t0 = tic;
            matFile = fullfile(tempDir, sprintf('converted-%s-%s-%d.mat', testName, format, i));
            feval(saveFcn, matFile, testData, formatFlag);
            te(iPass) = toc(t0);
            if iPass == nPasses
                fprintf('%7.3f sec      %5.3f GB used   %5.0f MB file   %5.3f sec mean\n',...
                    te(iPass), physicalMemoryUsed/(2^30), getfield(dir(matFile),'bytes')/(2^20), mean(te));
            else
                fprintf('%7.3f sec      %5.3f GB used\n', te(iPass), physicalMemoryUsed/(2^30));
            end
        end
        % Verify data to make sure we are sane
        gotBack = load(matFile);
        gotBack = rmfield(gotBack, intersect({'dummy'}, fieldnames(gotBack)));
        if ~isequal(gotBack, testData)
            fprintf('ERROR: Loaded data differs from original for %s %s\n', formatFlag, testName);
        end
    end
end

% Clean up
rmdir(tempDir, 's');

%%
function saveSimple(file, data, formatFlag)
save(file, '-struct', 'data', formatFlag);

%%
function out = physicalMemoryUsed()
if ~ispc
    out = NaN;
    return; % memory() only works on Windows
end
[u,s] = memory();
out = s.PhysicalMemory.Total - s.PhysicalMemory.Available;

%%
function saveBigAppend(file, data, formatFlag)
dummy = 0;
save(file, 'dummy', formatFlag);
fieldNames = fieldnames(data);
save(file, '-struct', 'data', fieldNames{:}, '-append', formatFlag);

%%
function saveMultiAppend(file, data, formatFlag)
fieldNames = fieldnames(data);
for i = 1:numel(fieldNames)
    if (i > 1); appendFlag = '-append'; else; appendFlag = ''; end
    save(file, '-struct', 'data', fieldNames{i}, appendFlag, formatFlag);
end


%%
function testData = generateTestData(imax)
nBlocks = 40;
blockSize = [65 480 240];
for i = 1:nBlocks
    testData.(sprintf('block_%03u', i)) = struct('blockNo',i,...
        'frames', randi([0 imax], blockSize, 'uint8'));
end

%%
function out = timestamp()
%TIMESTAMP Showing timestamps to make sure it is not a tic/toc problem
out = datestr(now, 'HH:MM:SS.FFF');

%%
function out = systemDescription()
if ispc
    platform = [system_dependent('getos'),' ',system_dependent('getwinsys')];
elseif ismac
    [fail, input] = unix('sw_vers');
    if ~fail
        platform = strrep(input, 'ProductName:', '');
        platform = strrep(platform, sprintf('\t'), '');
        platform = strrep(platform, sprintf('\n'), ' ');
        platform = strrep(platform, 'ProductVersion:', ' Version: ');
        platform = strrep(platform, 'BuildVersion:', 'Build: ');
    else
        platform = system_dependent('getos');
    end
else
    platform = system_dependent('getos');
end
arch = getenv('PROCESSOR_ARCHITEW6432');
if isempty(arch)
    arch = getenv('PROCESSOR_ARCHITECTURE');
end
try
    [~,sysMem] = memory();
catch
    sysMem.PhysicalMemory.Total = NaN;
end
out = sprintf('%s, arch=%s, %.0f GB, os=%s',...
    computer, arch, sysMem.PhysicalMemory.Total/(2^30), platform);

%%
function out = setupTempDir()
out = fullfile(tempdir, sprintf('%s - %s', mfilename, datestr(now, 'yyyymmdd-HHMMSS-FFF')));
mkdir(out);

EDIT: I modified the repro function, adding multiple iterations and parameterizing it for save styles, file formats, and imax for the randi generator.

I think filesystem caching is a big factor to the fast -append behavior. When I do a bunch of runs in a row with reproMatfileAppendSpeedup(20) and watch System Information in Process Explorer, most of them are under a second, and physical memory usage quickly ramps up by a couple GB. Then every dozen passes, the write stalls and takes 20 or 30 seconds, and physical RAM usage slowly ramps down to about where it started. I think this means that Windows is caching a lot of writes in RAM, and something about -append makes it more willing to do so. But the amortized time including those stalls is still a lot faster than the basic save, for me.

By the way, after doing multiple passes for a couple hours, I'm having a hard time reproducing the original timings.

Andrew Janke
  • 23,508
  • 5
  • 56
  • 85
  • @Andrew: Thankyou for this. This is very helpful. I reproduced the output from your function on my system (win7 pro, 32-bit). See the edit of my original question. Note: I disabled the random-data-generator - instead I write the same random data from memory for all formats. Could be very interesting to hear from someone working at Mathworks - anyone here at stackoverflow perhaps? Perhaps the person directly responsible for the **save** command :) – Ole Thomsen Buus Feb 10 '11 at 14:45
  • @AndrewJanke: +1 very interesting comparison – Amro Feb 10 '11 at 16:33
  • Note that `memory()` is unavailable on Mac. I ran your script on a `MACI64` system (10.6) and see similar results. Results: basic/multiappend/bigappend -> 12.7/3.6/3.1. – mbauman Feb 11 '11 at 17:44
  • @Andrew: I just edited a tiny bit in your fine work (many thanks for the effort and interest). (Hope it is ok, with the edit. I am still working out the mindset behind the stackexchange site-model, but I hope got the idea). Anyway, you understandably add a "dummy" variable to the "bigappend" version. This is not strictly neccesary as you see in the edit above (in function "saveBigAppend"). Otherwise your script would have crashed in the read-test if one does not use "bigappend" in the argument _tests_. Btw, howcome the script only make a read-test on the result of "bigappend"? – Ole Thomsen Buus Feb 11 '11 at 20:29
  • @Ole: No prob with the edit, but I don't see the changes - sure you hit the right button? Anyway, I see the issue you're pointing out. Added a test for dummy's existence. The read-test really should be on each save method; I only did it at the end because I didn't want to clutter up the output with a bunch of extra lines. Fixed by doing on each save method and only emitting if it looks bad. Tightened up output format. – Andrew Janke Feb 11 '11 at 21:12
  • @Andrew: Ahh my edit did no go through - need 2000 rep to edit other's posts... Anyway instead of adding the "dummy" variable, just do the folllowing: `save(file, '-struct', 'data', fieldNames{1}, formatFlag); save(file, '-struct', 'data', fieldNames{2:end}, '-append', formatFlag);` – Ole Thomsen Buus Feb 11 '11 at 21:14
  • @Matt B.: Thanks. Added an ispc() guard to the memory() call. Interesting that this carries over to another OS; I was wondering how much was Windows-specific I/O and disk caching behavior. Now I wonder how much is compression time. Do you have h5dump? Can you tell if the v7.3 files produced with -append are uncompressed like they seem to be for me? Compressed files will have "FILTERS { ... DEFLATE ... }" in the frames dataset; uncompressed will have "FILTERS { NONE }". – Andrew Janke Feb 11 '11 at 21:18
  • 1
    I've made the system description platform independent. Also, I've reproduced the test on OSX 64-bit with default settings and found that R2011a is a tiny bit faster than 2010b, but there is consistent speedup (16/6/5 with 3 runs and default values - hmm, maybe I need a faster computer). All files are 286MB, so I assume that there is no compression. Also: +1 for writing the test function. – Jonas Feb 13 '11 at 04:29
  • @Andrew, @Matt and everyone else: I am working hard on my script for extracting, parsing, processing the raw image data. I think I will try to run the script (when it's done) twice: with no append, and with block-based append (65-frames in each block). Perhaps a third time using "bigappend". I will be returning to this question with the results. Probably next week. Very interesting to see how things look with real data. – Ole Thomsen Buus Feb 14 '11 at 10:49
  • @Jonas - Nice portability fix. +1 for showing me something I'll be using in my work code. – Andrew Janke Feb 14 '11 at 15:42
  • @Andrew Janke: I've copied this from my [new function template](http://www.mathworks.com/matlabcentral/fileexchange/27132-automatic-template-for-new-functions), for which I lifted it from the `ver` command. – Jonas Feb 14 '11 at 15:46
  • @AndrewJanke I never got around to fiddling more with this issue. But thanks for the interest back then. I got my files converted, got them analysed and everything is now going into my PhD thesis. Procrastination is a beautiful thing. – Ole Thomsen Buus Jun 20 '12 at 08:16
3

Just an update in case it is useful to others. I found Matlab bug 784028 which shows that the no compression for -append behaviour is fixed from 2012a. From some testing on my system this is indeed the case, compression happens for variables >10000 bytes with or without use of append, and never happens for smaller variables.

Unfortunately, the flip side of this is that there no appears no way at all to control use of compression with -v7.3 files.

robince
  • 10,826
  • 3
  • 35
  • 48
2

The experiments done by @AndrewJanke are very interesting. One thing to remember is that the three MAT-file formats you compare are quite different: v6 is uncompressed, v7 is compressed, while v7.3 is also compressed but uses a completely different implementation (HDF5 standard format vs. custom MATLAB-optimized format).

As to the save-all-vars-at-once vs. append-one-var-at-a-time comparison, I am also surprised by the results...

Community
  • 1
  • 1
Amro
  • 123,847
  • 25
  • 243
  • 454
  • @Amro: OK - v6 is uncompressed. Missed that detail. Okay, then if you look at the files sizes that I mention in the [edit] of the question, you can see they are all around 285-286 MiB. This says to me that I need to find the real image data and try it on that (see my comment on the question above regarding compression of uniformly distributed random data). – Ole Thomsen Buus Feb 10 '11 at 17:43
  • 1
    @OleThomsenBuus: to illustrate the effect of compression, replace the RANDI call in @AndrewJanke code with: `ones(sz,'uint8')` (in `generateTestData()` function). The result (in terms of file size) is that v6 is unaffected, v7 is the smallest, followed by the HDF5-format v7.3 since it has a larger storage overhead (refer to this question: http://stackoverflow.com/questions/4950630/matlab-differences-between-mat-versions/4959325#4959325) – Amro Feb 10 '11 at 18:13
  • 1
    @OleThomsenBuus: In fact, if you try the modification I suggested for the "append-cases", we can observe that for v7.3 format, compression occurs for each variable by itself (as it didn't detect the repetitions across variables), while in the case of v7, the file is significantly smaller thanks to compression of all of its contents upon writing... – Amro Feb 10 '11 at 18:22
  • @Amro - good point. I modified the repro function to parameterize randi; if you set imax to 0 you'll get all zeroes, which is equivalent to the ones call. Strangely, even with all-0s, the HDF5 file doesn't seem to be getting any compression at all out of it; it's still 286 MB and the HDF5 header shows a compression ratio of 1.000:1 in h5dump. – Andrew Janke Feb 10 '11 at 22:38
  • @Andrew: I tried Amro's idea and got 1 MiB, 279 MiB, and 286 MiB for simple, append, and batch append, respectively. The times where: 0.72, 0.68, and 4.41s. I expect the "randomness" of my image-data to be somewhere between all 0s and uniform pseudorandom noise. It seems that only the "simple" method gives the expected compression of all-zeros. Perhaps there will nothing to gain from appending blocks then: I migth end up with larger .mat-files than necessary. Hmm... – Ole Thomsen Buus Feb 11 '11 at 09:20
  • OMG, I didn't even notice the file sizes were varying with the save method. I get almost the same results: 2 MiB, 279 MiB, 286 MiB. I modified the repro function to report file size per save type. H5dump on the "-append" files says the "frames" datasets have "FILTERS { NONE }", which means they're not being compressed at all. Don't know why. – Andrew Janke Feb 11 '11 at 15:47