How to accelerate C++ writing speed to the speed tested by CrystalDiskMark?

Question

Now I get about 3.6GB data per second in memory, and I need to write them on my SSD continuously. I used CrystalDiskMark to test the writing speed of my SSD, it is almost 6GB per second, so I had thought this work should not be that hard.

![my SSD test result][1]:

[1]https://plus.google.com/u/0/photos/photo/106876803948041178149/6649598887699308850?authkey=CNbb5KjF8-jxJQ "test result":

My computer is Windows 10, using Visual Studio 2017 community.

I found this question and tried the highest voted answer. Unfortunately, the writing speed was only about 1s/GB for his option_2, far slower than tested by CrystalDiskMark. And then I tried memory mapping, this time writing becomes faster, about 630ms/GB, but still much slower. Then I tried multi-thread memory mapping, it seems that when the number of threads is 4, the speed was about 350ms/GB, and when I add the threads' number, the writing speed didn't go up anymore.

Code for memory mapping:

#include <fstream>
#include <chrono>
#include <vector>
#include <cstdint>
#include <numeric>
#include <random>
#include <algorithm>
#include <iostream>
#include <cassert>
#include <thread>
#include <windows.h>
#include <sstream>


// Generate random data
std::vector<int> GenerateData(std::size_t bytes) {
    assert(bytes % sizeof(int) == 0);
    std::vector<int> data(bytes / sizeof(int));
    std::iota(data.begin(), data.end(), 0);
    std::shuffle(data.begin(), data.end(), std::mt19937{ std::random_device{}() });
    return data;
}

// Memory mapping
int map_write(int* data, int size, int id){
    char* name = (char*)malloc(100);
    sprintf_s(name, 100, "D:\\data_%d.bin",id);
    HANDLE hFile = CreateFile(name, GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);//
    if (hFile == INVALID_HANDLE_VALUE){
        return -1;
    }

    Sleep(0);

    DWORD dwFileSize = size;

    char* rname = (char*)malloc(100);
    sprintf_s(rname, 100, "data_%d.bin", id);

    HANDLE hFileMap = CreateFileMapping(hFile, NULL, PAGE_READWRITE, 0, dwFileSize, rname);//create file  
    if (hFileMap == NULL) {
        CloseHandle(hFile);
        return -2;
    }

    PVOID pvFile = MapViewOfFile(hFileMap, FILE_MAP_WRITE, 0, 0, 0);//Acquire the address of file on disk
    if (pvFile == NULL) {
        CloseHandle(hFileMap);
        CloseHandle(hFile);
        return -3;
}

    PSTR pchAnsi = (PSTR)pvFile;
    memcpy(pchAnsi, data, dwFileSize);//memery copy 

    UnmapViewOfFile(pvFile);

    CloseHandle(hFileMap);
    CloseHandle(hFile);

    return 0;
}

// Multi-thread memory mapping
void Mem2SSD_write(int* data, int size){
    int part = size / sizeof(int) / 4;

    int index[4];

    index[0] = 0;
    index[1] = part;
    index[2] = part * 2;
    index[3] = part * 3;

    std::thread ta(map_write, data + index[0], size / 4, 10);
    std::thread tb(map_write, data + index[1], size / 4, 11);
    std::thread tc(map_write, data + index[2], size / 4, 12);
    std::thread td(map_write, data + index[3], size / 4, 13);

    ta.join();
    tb.join();
    tc.join();
    td.join();
 }

//Test:
int main() {
    const std::size_t kB = 1024;
    const std::size_t MB = 1024 * kB;
    const std::size_t GB = 1024 * MB;

    for (int i = 0; i < 10; ++i) {
        std::vector<int> data = GenerateData(1 * GB);
        auto startTime = std::chrono::high_resolution_clock::now();
        Mem2SSD_write(&data[0], 1 * GB);
        auto endTime = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
        std::cout << "1G writing cost: " << duration << " ms" << std::endl;
    }

    system("pause");
    return 0;
}

So I'd like to ask, is there any faster writing method for C++ to writing huge files? Or, why can't I write as fast as tested by CrystalDiskMark? How does CrystalDiskMark write?

Any help would be greatly appreciated. Thank you!

Thread creation, memory allocation, file opening closing the file may cost. You should measure the time between the start of memcpy, then wait for the system `sync` the written data. What I suspect cost a lot may be page faults, or any kernel managment of the memory. I don't know Window but without any doubt, there are some system calls to inform the kernel how you are going to use the memory and when it is appropriate to sync the write to the disk. — Oliv, Jan 23 '19 at 08:21
I would look at the windows API o try and find anything that gives you a more manual handle of memory and kernel manipulation of it. I'm assuming that page faults and the kernel are slowing you down by performing checks and moving things around. — Prodigle, Jan 23 '19 at 09:24
In option_2 the benchmark is also measuring the time to open and close the file. This depends on the filesystem not just on how fast your hardware writes bytes. I suppose CrystalDiskMark just measures the time to write the bytes: a typical user wants to see a number as near as possible to the theoric speed of his disk when he runs that kind of benchmarks... — L.C., Jan 23 '19 at 11:58
Also, I guess it matters that the file is written in a continuous set of disk sectors and that no other process is using the disk. Last, but not least, did you disable "on access scan" of Windows Defender? — L.C., Jan 23 '19 at 12:12

score 5 · Answer 1 · answered Jan 23 '19 at 19:41

first of all this is not c++ question but os related question. for get maximum performance need need use os specific low level api call, which not exist in general c++ libs. from your code clear visible that you use windows api, so search solution for windows how minimum.

from CreateFileW function:

When FILE_FLAG_NO_BUFFERING is combined with FILE_FLAG_OVERLAPPED, the flags give maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager.

so we need use combination of this 2 flags in call CreateFileW or FILE_NO_INTERMEDIATE_BUFFERING in call NtCreateFile

also extend file size and valid data length take some time, so better if final file at begin is known - just set file final size via NtSetInformationFile with FileEndOfFileInformation or via SetFileInformationByHandle with FileEndOfFileInfo. and then set valid data length with SetFileValidData or via NtSetInformationFile with FileValidDataLengthInformation. set valid data length require SE_MANAGE_VOLUME_NAME privilege enabled when opening a file initially (but not when call SetFileValidData)

also look for file compression - if file compressed (it will be compressed by default if created in compressed folder) this is very slow writting. so need disbale file compression via FSCTL_SET_COMPRESSION

then when we use asynchronous I/O (fastest way) we not need create several dedicated threads. instead we need determine number of I/O requests run in concurrent. if you use CrystalDiskMark it actually run CdmResource\diskspd\diskspd64.exe for test and this is coresponded to it -o<count> parameter (run diskspd64.exe /? > h.txt for look parameters list).

use non Buffering I/O make task more hard, because exist 3 additional requirements:

Any ByteOffset passed to WriteFile must be a multiple of the sector size.
The Length passed to WriteFile must be an integral of the sector size
Buffers must be aligned in accordance with the alignment requirement of the underlying device. To obtain this information, call NtQueryInformationFile with FileAlignmentInformation or GetFileInformationByHandleEx with FileAlignmentInfo

in most situations, page-aligned memory will also be sector-aligned, because the case where the sector size is larger than the page size is rare.

so almost always buffers allocated with VirtualAlloc function and multiple page size (4,096 bytes ) is ok. in concrete test for smaller code size i use this assumption

struct WriteTest 
{
    enum { opCompression, opWrite };

    struct REQUEST : IO_STATUS_BLOCK 
    {
        WriteTest* pTest;
        ULONG opcode;
        ULONG offset;
    };

    LONGLONG _TotalSize, _BytesLeft;
    HANDLE _hFile;
    ULONG64 _StartTime;
    void* _pData;
    REQUEST* _pRequests;
    ULONG _BlockSize;
    ULONG _ConcurrentRequestCount;
    ULONG _dwThreadId;
    LONG _dwRefCount;

    WriteTest(ULONG BlockSize, ULONG ConcurrentRequestCount) 
    {
        if (BlockSize & (BlockSize - 1))
        {
            __debugbreak();
        }
        _BlockSize = BlockSize, _ConcurrentRequestCount = ConcurrentRequestCount;
        _dwRefCount = 1, _hFile = 0, _pRequests = 0, _pData = 0;
        _dwThreadId = GetCurrentThreadId();
    }

    ~WriteTest()
    {
        if (_pData)
        {
            VirtualFree(_pData, 0, MEM_RELEASE);
        }

        if (_pRequests)
        {
            delete [] _pRequests;
        }

        if (_hFile)
        {
            NtClose(_hFile);
        }

        PostThreadMessageW(_dwThreadId, WM_QUIT, 0, 0);
    }

    void Release()
    {
        if (!InterlockedDecrement(&_dwRefCount))
        {
            delete this;
        }
    }

    void AddRef()
    {
        InterlockedIncrementNoFence(&_dwRefCount);
    }

    void StartWrite()
    {
        IO_STATUS_BLOCK iosb;

        FILE_VALID_DATA_LENGTH_INFORMATION fvdl;
        fvdl.ValidDataLength.QuadPart = _TotalSize;
        NTSTATUS status;

        if (0 > (status = NtSetInformationFile(_hFile, &iosb, &_TotalSize, sizeof(_TotalSize), FileEndOfFileInformation)) ||
            0 > (status = NtSetInformationFile(_hFile, &iosb, &fvdl, sizeof(fvdl), FileValidDataLengthInformation)))
        {
            DbgPrint("FileValidDataLength=%x\n", status);
        }

        ULONG offset = 0;
        ULONG dwNumberOfBytesTransfered = _BlockSize;

        _BytesLeft = _TotalSize + dwNumberOfBytesTransfered;

        ULONG ConcurrentRequestCount = _ConcurrentRequestCount;

        REQUEST* irp = _pRequests;

        _StartTime = GetTickCount64();

        do 
        {
            irp->opcode = opWrite;
            irp->pTest = this;
            irp->offset = offset;
            offset += dwNumberOfBytesTransfered;
            DoWrite(irp++);
        } while (--ConcurrentRequestCount);
    }

    void FillBuffer(PULONGLONG pu, LONGLONG ByteOffset)
    {
        ULONG n = _BlockSize / sizeof(ULONGLONG);
        do 
        {
            *pu++ = ByteOffset, ByteOffset += sizeof(ULONGLONG);
        } while (--n);
    }

    void DoWrite(REQUEST* irp)
    {
        LONG BlockSize = _BlockSize;

        LONGLONG BytesLeft = InterlockedExchangeAddNoFence64(&_BytesLeft, -BlockSize) - BlockSize;

        if (0 < BytesLeft)
        {
            LARGE_INTEGER ByteOffset;
            ByteOffset.QuadPart = _TotalSize - BytesLeft;

            PVOID Buffer = RtlOffsetToPointer(_pData, irp->offset);

            FillBuffer((PULONGLONG)Buffer, ByteOffset.QuadPart);

            AddRef();

            NTSTATUS status = NtWriteFile(_hFile, 0, 0, irp, irp, Buffer, BlockSize, &ByteOffset, 0);

            if (0 > status)
            {
                OnComplete(status, 0, irp);
            }
        }
        else if (!BytesLeft)
        {
            // write end
            ULONG64 time = GetTickCount64() - _StartTime;

            WCHAR sz[64];
            StrFormatByteSizeW((_TotalSize * 1000) / time, sz, RTL_NUMBER_OF(sz));
            DbgPrint("end:%S\n", sz);
        }
    }

    static VOID NTAPI _OnComplete(
        _In_    NTSTATUS status,
        _In_    ULONG_PTR dwNumberOfBytesTransfered,
        _Inout_ PVOID Ctx
        )
    {
        reinterpret_cast<REQUEST*>(Ctx)->pTest->OnComplete(status, dwNumberOfBytesTransfered, reinterpret_cast<REQUEST*>(Ctx));
    }

    VOID OnComplete(NTSTATUS status, ULONG_PTR dwNumberOfBytesTransfered, REQUEST* irp)
    {
        if (0 > status)
        {
            DbgPrint("OnComplete[%x]: %x\n", irp->opcode, status);
        }
        else 
        switch (irp->opcode)
        {
        default:
            __debugbreak();

        case opCompression:
            StartWrite();
            break;

        case opWrite:
            if (dwNumberOfBytesTransfered == _BlockSize)
            {
                DoWrite(irp);
            }
            else
            {
                DbgPrint(":%I64x != %x\n", dwNumberOfBytesTransfered, _BlockSize);
            }
        }

        Release();
    }

    NTSTATUS Create(POBJECT_ATTRIBUTES poa, ULONGLONG size)
    {
        if (!(_pRequests = new REQUEST[_ConcurrentRequestCount]) ||
            !(_pData = VirtualAlloc(0, _BlockSize * _ConcurrentRequestCount, MEM_COMMIT, PAGE_READWRITE)))
        {
            return STATUS_INSUFFICIENT_RESOURCES;
        }

        ULONGLONG sws = _BlockSize - 1;
        LARGE_INTEGER as;

        _TotalSize = as.QuadPart = (size + sws) & ~sws;

        HANDLE hFile;
        IO_STATUS_BLOCK iosb;

        NTSTATUS status = NtCreateFile(&hFile,
            DELETE|FILE_GENERIC_READ|FILE_GENERIC_WRITE&~FILE_APPEND_DATA,
            poa, &iosb, &as, 0, 0, FILE_OVERWRITE_IF, 
            FILE_NON_DIRECTORY_FILE|FILE_NO_INTERMEDIATE_BUFFERING, 0, 0);

        if (0 > status)
        {
            return status;
        }

        _hFile = hFile;

        if (0 > (status = RtlSetIoCompletionCallback(hFile, _OnComplete, 0)))
        {
            return status;
        }

        static USHORT cmp = COMPRESSION_FORMAT_NONE;

        REQUEST* irp = _pRequests;

        irp->pTest = this;
        irp->opcode = opCompression;

        AddRef();
        status = NtFsControlFile(hFile, 0, 0, irp, irp, FSCTL_SET_COMPRESSION, &cmp, sizeof(cmp), 0, 0);

        if (0 > status)
        {
            OnComplete(status, 0, irp);
        }

        return status;
    }
};

void WriteSpeed(POBJECT_ATTRIBUTES poa, ULONGLONG size, ULONG BlockSize, ULONG ConcurrentRequestCount)
{
    BOOLEAN b;
    NTSTATUS status = RtlAdjustPrivilege(SE_MANAGE_VOLUME_PRIVILEGE, TRUE, FALSE, &b);

    if (0 <= status)
    {
        status = STATUS_INSUFFICIENT_RESOURCES;

        if (WriteTest * pTest = new WriteTest(BlockSize, ConcurrentRequestCount))
        {
            status = pTest->Create(poa, size);

            pTest->Release();

            if (0 <= status)
            {
                MessageBoxW(0, 0, L"Test...", MB_OK|MB_ICONINFORMATION);
            }
        }
    }
}

This seems like a great answer, but could you maybe post an example with header includes, etc.? — MHO, Oct 22 '19 at 12:58

L.C. · Answer 2 · 2019-01-23T21:59:46.803

2

These are the suggestions that come to my mind:

stop all running processes that are using the disk, in particular
- disable Windows Defender realtime protection (or other anti virus/malware)
- disable pagefile
use Windows Resource Monitor to find processes reading or writing to your disk
make sure you write continuous sectors on disk
don't take into account file opening and closing times
do not use multithreading (your disk is using DMA so the CPU won't matter)
write data that is in RAM (obviously)
be sure to disable all debugging features when building (build a release)
if using M.2 PCIe disk (seems to be your case) make sure other PCIe devices aren't stealing PCIe lanes to your disk (the CPU has a limited number AND mobo too)
don't run the test from your IDE
disable Windows file indexing

Finally, you can find good hints on how to code fast writes in C/C++ in this question's thread: Writing a binary file in C++ very fast

edited Jan 23 '19 at 21:59

answered Jan 23 '19 at 12:40

L.C.

1,098
10
21

I'd say it matters, much less than on a mechanical HD but it still seems to matter: just compare random vs sequential write benchmarks for any SSD. That's probably because the controller in the SSD still has to do something (changing the address it's writing to rather that incrementing it; it's not much... but it's not nothing). – L.C. Jan 23 '19 at 12:59
Why is the multithreading example above faster if this does not matter? – MHO Oct 22 '19 at 13:49
Because the example isn't reasonably trying to write as fast as possible. I don't know why wrong code behaves how it behaves but I know how it should behave if it was using DMA efficiently. Direct memory access (DMA) is a feature that allows certain hardware subsystems to access main system memory (RAM), independent of the central processing unit (CPU). – L.C. Oct 23 '19 at 20:06

score -1 · Answer 3 · answered Jan 23 '19 at 09:27

One area that might give you improvement is to have your threads running constantly and each reading from a queue.

At the moment every time you go to write you spawn 4 threads (which is slow) and then they're deconstructed at the end of the function. You'll see a speedup of at least the cpu time of your function if you spawn the threads at the start and have them all reading from separate queue's in an infinite loop.

They'll simply check after a SMALL delay if there's anything in their queue, if their is they'll write it all. Your only issue then is making sure order of data is maintained.

How to accelerate C++ writing speed to the speed tested by CrystalDiskMark?

3 Answers3