3

I am looking at designing an application in C/C++ that is running under Linux, currently Ubuntu 20.04. The application should be able to store data at rates around 1-10GB/s continuously, preferably on a filesystem but that is not mandatory. For this I have four NVMe disks in the system organized as a software RAID0 and should have a theoretical maximum transfer rate of around 12GB/s.

First I experimented with just coping a file between an other NVMe and the RAID, with rates just around 1-2GB/s. Then I wrote a simple c/c++ program that used the standard "write" command, but with out any improvements.

Then I realized that there is a tool in Ubuntu, gnome-disks. When I did the read and write benchmark I reached around 10-11GB/s. I realize that the utility is not using the filesystem but writing to and from the disk it self and uses the fact that NVMe disks can do things in parallell.

Looking at the source code for gnome-disks it looks to me that the utility is using "write" as well to write the disk. By looking at strace from gnome-disks I would guess that write is replaced by poll, writev and possible recvmsg.

How should NVMe disks be used in c/c++ to get good performance (how does gnome-disks do it?)? Libraries like libnvme, SPDK, aio, udisks2?


I just made a short program that used pwritev2 instead of write. This gave a massive improvement and I am now reaching around 12GB/s writes. The code is below. Compile it and run in by giving the path to the device you want to test, e.g. speed /dev/md/raid0

The program will destroy the data on the device so be careful.

#include <sys/uio.h>
#include <fcntl.h>

#include <stdio.h>
#include <stdlib.h>
#include <cstring>
#include <unistd.h>

#include <math.h>

#include <time.h>
#include <errno.h>

#define LOOPS (20)
#define BUFFERS (8)
#define BUFFER_SIZE (250*1024*1024)

#error Do not run this program on a device where you have data you want to keep. This program will destroy the data!
int main(int argc, char *argv[])
{
    if( argc == 2 )
    {
        int fd = open(argv[1], O_RDWR | O_DIRECT);

        if( fd > -1 )
        {
            ssize_t page_size;
            char *buffers_unaligned[BUFFERS];
            iovec iov[BUFFERS];
        
            page_size = sysconf(_SC_PAGESIZE);

            for( int i=0; i<BUFFERS; i++)
            {
                buffers_unaligned[i] = (char*)malloc( BUFFER_SIZE + page_size );
                
                if( buffers_unaligned[i] == NULL )
                {
                    perror("Unable to allocate memmory");
                    exit(EXIT_FAILURE);
                }
                else
                {
                    iov[i].iov_base = (char*)(((long)buffers_unaligned[i] + page_size) & (~(page_size-1)));
                    memset( iov[i].iov_base, i, BUFFER_SIZE );
                    iov[i].iov_len = BUFFER_SIZE;
                }
            }

            struct timespec start, stop, elapsed;
            ssize_t nwritten = 0;
       
            clock_gettime(CLOCK_MONOTONIC, &start);

            for( int loop=0; loop<LOOPS; loop++ )
                nwritten += pwritev2(fd, iov, BUFFERS, nwritten, RWF_HIPRI);

            fsync(fd);

            clock_gettime(CLOCK_MONOTONIC, &stop);

            printf("Written bytes: %zu\n", nwritten);
        
            if((stop.tv_sec - start.tv_sec) < 0)
            {
                elapsed.tv_sec = stop.tv_sec - start.tv_sec -1;
                elapsed.tv_nsec = stop.tv_nsec - start.tv_nsec + pow(10,9);
            }
            else
            {
                elapsed.tv_sec = stop.tv_sec - start.tv_sec;
                elapsed.tv_nsec = stop.tv_nsec - start.tv_nsec;
            }
        
            unsigned long nsec = elapsed.tv_sec*pow(10,9) + elapsed.tv_nsec;
        
            printf("Time: %luns\n", nsec);
        
            double speed = (double)nwritten/nsec;
        
            printf("Speed: %f GB/s\n", speed);
            close(fd);

            for( int i=0; i<BUFFERS; i++)
            {
                free(buffers_unaligned[i]);
                buffers_unaligned[i] = NULL;
            }
        }
    }
    else
    {
        perror("Error with open: ");
    }
        
    return 0;
}
duper
  • 39
  • 3
  • 1
    If you show a [mre] of a program trying to write as fast as possible to a file, perhaps someone can see if there are any improvements you could make to it. – Ted Lyngmo Feb 08 '22 at 18:14
  • You've already found a huge part of the secret to super speed: Work around the OS and filesystem by directly interacting with the drive. How? That's a broad subject that's not particularly well suited to Stack Overflow, books are literally written on the subject, but I've been surprised by the explanatory powers of some of the gurus on this site. Hope you get a good answer. – user4581301 Feb 08 '22 at 18:15
  • If you write to a file, you are also dealing with file cache, inodes, journalling and directories. That overhead must be accounted for in your bandwidth calculation. – stark Feb 08 '22 at 18:20
  • 1
    Use related flags to bypass OS caching, etc. You'll end up writing in (properly aligned) blocks of certain size, you probably need to pre-allocate space for your file(s). Afaik, there is no local async i/o in linux, so you'll end up using thread pool to keep your HDD queues full. And, if you need full control to squeeze maximum performance -- you'll end up writing your own driver... with blackjack and hookers. Good luck :D – C.M. Feb 08 '22 at 21:09
  • 1
    [this](https://stackoverflow.com/questions/13407542/is-there-really-no-asynchronous-block-i-o-on-linux) can be helpful – C.M. Feb 08 '22 at 21:13
  • 1
    If you have found a solution, please post it as an answer instead of editing the question. – VLL Feb 09 '22 at 11:51
  • You should additionally clarify a bit more about the physical layout and hardware platform you are working with. Depending on the available and used PCIe lanes, you may be saturating links well before you reach the theoretical maximum capability of the drive. We'd run into cases where a riser card with multiple drives hooked into it only used 4x lanes, but to saturate both drives attached we needed 8x. – I.F. Adams Jun 17 '22 at 20:55

0 Answers0