3

I need to write a large array of data to disk as fast as possible. From MATLAB I can do that with fwrite:

function writeBinaryFileMatlab(data)
    fid = fopen('file_matlab.bin', 'w');
    fwrite(fid, data, class(data));
    fclose(fid);
end

Now I have to do the same, but from a MEX file called by MATLAB. So I setup a MEX function that can write to file using either fstream or fopen (Inspired by the results of this SO post). This is however much slower than calling fwrite from MATLAB, as you can see below. Why is this the case, and what can I do to increase my write speed from the MEX function.

#include "mex.h"
#include <iostream>
#include <stdio.h>
#include <fstream>

using namespace std;

void writeBinFile(int16_t *data, size_t size)
{
    FILE *fID;
    fID = fopen("file_fopen.bin", "wb");
    fwrite(data, sizeof(int16_t), size, fID);
    fclose(fID);
}

void writeBinFileFast(int16_t *data, size_t size)
{
    ofstream file("file_ostream.bin", std::ios::out | std::ios::binary);
    file.write((char *)&data[0], size * sizeof(int16_t));
    file.close();
}

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])
{
    const mxArray *mxPtr = prhs[0];
    size_t nelems = mxGetNumberOfElements(mxPtr);
    int16_t *ptr = (int16_t *)mxGetData(mxPtr);
#ifdef USE_OFSTREAM
    writeBinFileFast(ptr, nelems);
#else
    writeBinFile(ptr, nelems);
#endif
}

Then I check the performance using the following script:

mex -R2018a -Iinclude CXXFLAGS="$CXXFLAGS -O3" -DUSE_OFSTREAM main.cpp -output writefast_ofstream
mex -R2018a -Iinclude CXXFLAGS="$CXXFLAGS -O3" main.cpp -output writefast_fwrite

for k = 1:10
    sizeBytes = 2^k * 1024 * 1024;
    fprintf('Generating data of size %i MB\n', sizeBytes / 2^20)
    M = sizeBytes / 2; % 2 bytes for an int16
    sizeMB(k) = sizeBytes / 2^20;
    data = int16(rand(M, 1) * 100);

    fprintf('TESTING: write matlab\n')
    t_matlab(k) = timeit(@() writeBinaryFileMatlab(data));

    fprintf('TESTING: write ofstream\n')
    t_ofstream(k) = timeit(@() writefast_ofstream(data), 0);

    fprintf('TESTING: write fwrite\n')
    t_fwrite(k) = timeit(@() writefast_fwrite(data), 0);
end

% and plot result
figure(14); clf;
plot((sizeMB), t_matlab)
hold on
plot((sizeMB), t_ofstream)
plot((sizeMB), t_fwrite)
legend('Matlab', 'ofstream', 'fwrite')
xticks(sizeMB)

Which gives me the plot below. Why is calling fwrite from MATLAB so much faster than doing it from MEX? How can I reach the same speed in my MEX function?

I am using Windows 10. Laptop with Core i7, SSD.


UPDATE

I have tried various suggestions in the comments, but still do not reach MATLAB's fwrite performance. See the repo with the source code here: https://github.com/rick3rt/saveBinaryDataMex

This is the result with MSVC 2017, incorporating the suggestion of rahnema1:

enter image description here

UPDATE 2

Wow I finally got something that's faster than MATLAB! Rahnema1's answer did the trick :) Here the figures with all suggested methods combined (complete src can be found on Github).

enter image description here

rinkert
  • 6,593
  • 2
  • 12
  • 31
  • This is extremely weird. I would be surprised if MATLAB didn’t use the same `fopen` and `fwrite` under the hood as you do here. – Cris Luengo Nov 26 '21 at 15:36
  • @CrisLuengo yeah I thought so as well, maybe I use a suboptimal compiler for my system? – rinkert Nov 26 '21 at 15:39
  • Oh, are you on Windows? I bet on Windows `fwrite` is not a direct system call, but implemented in terms of some Windows API. – Cris Luengo Nov 26 '21 at 15:44
  • Did you try with the MSVC compiler? – Cris Luengo Nov 26 '21 at 15:48
  • I didnt have it installed, its installing now, will test later. – rinkert Nov 26 '21 at 15:52
  • 1
    Try to invert the order of the calls to `writeBinaryFileMatlab`, `writefast_ofstream` and `writefast_fwrite` in your test. I bet you will observe different timings. My hypothesis is that you are bounded by the OS cache, which might have more free space on first write than on successive ones. – prapin Nov 26 '21 at 18:53
  • And find out, if the files are flushed to disk at the end of the calls. – Sebastian Nov 27 '21 at 11:23
  • On Windows this may provide higher performance: `HANDLE file = CreateFileA("file_createfile.bin", GENERIC_WRITE, 0, CREATE_ALWAYS, FILE_FLAG_SEQUENTIAL_SCAN, NULL);WriteFile(file, data, sizeof(int16_t) * size, NULL, NULL);CloseHandle(file);` – rahnema1 Nov 27 '21 at 15:04
  • @prapin Thanks for the suggestion, however, that does not change the results. Under the hood, `timeit` already executes the called function multiple times, so I guess for the larger file sizes caching happens equally for all methods. – rinkert Nov 29 '21 at 11:36
  • @CrisLuengo changing to MSVC++ 2017 did not really affect the results, matlab still is a lot faster – rinkert Nov 29 '21 at 11:37
  • @Sebastian How to make sure that happens? – rinkert Nov 29 '21 at 11:38
  • @rahnema1 thanks for the suggestion, its not faster unfortunately... BTW I had to add another NULL after the 0, so: `HANDLE file = CreateFileA(fname, GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_FLAG_SEQUENTIAL_SCAN, NULL);` – rinkert Nov 29 '21 at 12:27
  • Did you try `FILE_FLAG_NO_BUFFERING` or `FILE_FLAG_WRITE_THROUGH` for the `dwFlagsAndAttributes`? "If FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING are both specified, so that system caching is not in effect, then the data is immediately flushed to disk without going through the Windows system cache. The operating system also requests a write-through of the hard disk's local hardware cache to persistent media." – Cris Luengo Nov 29 '21 at 18:59
  • You can also try adding something like `setbuf( fID, NULL )` or `setvbuf( fID, NULL, _IONBF, 0 )` after `FILE *fID = fopen("file_fopen.bin", "wb");` and see how disabling buffering of the stream changes the results. And/or using `setvbuf()` to set a much larger buffer, such as 1 MB. – Andrew Henle Nov 29 '21 at 19:43
  • Here (https://social.msdn.microsoft.com/Forums/en-US/042c3ef8-742d-4eda-9c74-018516d86b8c/can-multiple-threads-concurrently-write-to-the-same-file-but-different-areas?forum=parallelcppnative) some more ideas: Write to the file from multiple threads or used Memory Mapped IO – Sebastian Nov 30 '21 at 07:03

2 Answers2

2

[This is a partial answer only, unfortunately.]

This is a Windows problem. I tried reproducing your results on macOS, and found a different, interesting behavior. I modified your code to distinguish between the C fwrite and the C++ std::fwrite, and I added code to write using the lower-level Posix write.

This is the C++ code:

#include "mex.h"
#include <stdio.h>
#include <cstdio>
#include <fcntl.h>
#include <unistd.h>

void writeBinFile_c(int16_t *data, std::size_t size)
{
    ::FILE *fID = ::fopen("file_c.bin", "wb");
    ::fwrite(data, sizeof(int16_t), size, fID);
    ::fclose(fID);
}

void writeBinFile_std(int16_t *data, std::size_t size)
{
    std::FILE *fID = std::fopen("file_std.bin", "wb");
    std::fwrite(data, sizeof(int16_t), size, fID);
    std::fclose(fID);
}

void writeBinFile_unix(int16_t *data, std::size_t size)
{
    int fID = open("file_unix.bin", O_CREAT|O_WRONLY|O_TRUNC);
    ::write(fID, data, sizeof(int16_t) * size);
    ::close(fID);
}

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])
{
    const mxArray *mxPtr = prhs[0];
    std::size_t nelems = mxGetNumberOfElements(mxPtr);
    int16_t *ptr = (int16_t *)mxGetData(mxPtr);
    double mode = -1;
    if (nrhs > 1) {
      mode = mxGetScalar(prhs[1]);
    }
    if (mode == 0) {
       writeBinFile_c(ptr, nelems);
    } else if (mode == 1) {
       writeBinFile_std(ptr, nelems);
    } else if (mode == 2) {
       writeBinFile_unix(ptr, nelems);
    } else {
       mexErrMsgTxt("Wrong mode!");
    }
}

This is the MATLAB code:

mex -R2018a -Iinclude CXXFLAGS="$CXXFLAGS -O3" writefast.cpp

N = 10;
sizeMB = zeros(1,N);
t_matlab = zeros(1,N);
t_fwrite_c = zeros(1,N);
t_fwrite_std = zeros(1,N);
t_unix = zeros(1,N);
for k = 1:10
    sizeBytes = 2^k * 1024 * 1024;
    fprintf('Generating data of size %i MB\n', sizeBytes / 2^20)
    M = sizeBytes / 2; % 2 bytes for an int16
    sizeMB(k) = sizeBytes / 2^20;
    data = int16(rand(M, 1) * 100);

    fprintf('TESTING: matlab\n')
    t_matlab(k) = timeit(@() writeBinaryFileMatlab(data));

    fprintf('TESTING: ::fwrite\n')
    t_fwrite_c(k) = timeit(@() writefast(data, 0), 0);

    fprintf('TESTING: std::fwrite\n')
    t_fwrite_std(k) = timeit(@() writefast(data, 1), 0);

    fprintf('TESTING: Unix write\n')
    t_unix(k) = timeit(@() writefast(data, 1), 0);
end

% and plot result
figure
plot((sizeMB), t_matlab)
hold on
plot((sizeMB), t_fwrite_c)
plot((sizeMB), t_fwrite_std)
plot((sizeMB), t_unix)
legend('Matlab', 'C std lib', 'C++ Std lib', 'Unix')
xticks(sizeMB)
set(gca,'xscale','log','yscale','log')

function writeBinaryFileMatlab(data)
    fid = fopen('file_matlab.bin', 'w');
    fwrite(fid, data, class(data));
    fclose(fid);
end

These are the outputs for two runs:

run 1 run2

Note how timings are consistent up to 64 MB, and then diverge. At 128 MB and up, the times are long enough for timeit to run the tool only once in the inner loop, and so you see the median time for individual runs, without averaging over multiple runs as it does at 64 MB and below. So for 128 MB and above we see the times flipping between two different times, which is maybe an effect of caching. But in different runs, it's different methods that are slower or faster, and so it is clear to me that they all do the same.

So, on macOS, there is no difference between MATLAB's fwrite and the C library fwrite. What you saw must be a Windows issue.

And I am pretty certain this has to do with caching, because:

  • This post on Undocumented MATLAB talks about the performance of fwrite, and how, by default, MATLAB flushes the cache after every call to fwrite. This is not relevant here, because there is only one call to fwrite. But the post indicates that the MATLAB function handles the cache differently than the C library's.

  • The C library fwrite works as if it calls fputc for each byte to be written. It probably doesn't actually do that, but this might be an indication of what is going wrong on Windows. Note that on Windows, with both the MSVC and the MinGW compilers you use the same C library, msvcrt. The problem must be there, and MATLAB must not be using it for writing to file.

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • Thanks for testing! It is indeed a Windows problem, just tested it on a CentOS cluster, and all results were comparable to Matlabs fwrite, just like your results. Interesting behaviour of timeit, guess writing a for loop might be better when testing these kind of operations – rinkert Nov 30 '21 at 15:27
2

As indicated in some posts very large buffers tend to decrease performance. So the buffer is written to the file part by part. For me 8 MiB gives the best performance.

void writeBinFilePartByPart(int16_t *int_data, size_t size)
{        
  size_t part = 8 * 1024 * 1024;

  size = size * sizeof(int16_t);
  
  char *data = reinterpret_cast<char *> (int_data);

  HANDLE file = CreateFileA (
    "windows_test.bin", 
    GENERIC_WRITE, 
    0, 
    NULL,
    CREATE_ALWAYS, 
    FILE_FLAG_SEQUENTIAL_SCAN, 
    NULL);
  
  // Expand file size
  SetFilePointer (file, size, NULL, FILE_BEGIN);
  SetEndOfFile (file);
  SetFilePointer (file, 0, NULL, FILE_BEGIN);

  DWORD written;
  if (size < part)
    {
      WriteFile (file, data, size, &written, NULL);  
      CloseHandle (file);
      return;
    }

  size_t rem = size % part;
  for (size_t i = 0; i < size-rem; i += part)
    {
      WriteFile (file, data+i, part, &written, NULL);
    }

  if (rem)
    WriteFile (file, data+size-rem, rem, &written, NULL);
  
  CloseHandle (file);
}

The output is compared to C++ Std lib method that is mentioned by @Cris Luengo :

enter image description here

rahnema1
  • 15,264
  • 3
  • 15
  • 27
  • Interesting! Using your function with a buffer of 8 MiB is faster than Matlabs `fwrite` as you can see in my plot. Any idea of what MATLAB is doing under the hood? Does it use a different buffer size than whats defined by default? And what are the other solutions using when its not specified, a much larger buffer size? – rinkert Nov 30 '21 at 15:32
  • Actually I tested different sizes. I tested different file systems (NTFS/FAT32/exFAT) on virtual ram disk and NTFS on real hard disk . In all of them it seems that 8 MiB is the optimal size and it is unrelated to disk fragmentation. I think it relates to some internal system buffer sizes and caching and the algorithm that Windows is using. – rahnema1 Nov 30 '21 at 16:18
  • I believe that the function that MATLAB is using should not be lower level than Writefile. MATLAB may used different size for writing to the disk and may used FILE_FLAG_RANDOM_ACCESS instead of FILE_FLAG_SEQUENTIAL_SCAN in call to createfile to support more general random access uses. Solutions other than ``ofstream`` possibly just use a single call to Writefile and ofstream depends on the library implementations that may have different behaviors between Gcc/clang or VS. – rahnema1 Nov 30 '21 at 16:19