2

I'm trying to use vmsplice to replace write when writing to a pipe because write seems to have a huge slowdown through a pipe, in my computer about 0.1 times the speed when writing directly without a pipe. According to this post, write is slow because it has to copy the buffer to the pipe while vmsplice can do the same job copyless.

In the code, outw and outv is meant to do the same job. I wrote outv the same way the author of the linked post wrote in assembly.

mov [%rip + iovec_base], OUTPUT_PTR
mov [%rip + iovec_base + 8], %rdx
mov ARG1e, 1
lea ARG2, [%rip + iovec_base]
mov ARG3e, 1
xor ARG4e, ARG4e
1: mov SYSCALL_NUMBER, __NR_vmsplice
syscall
call exit_on_error
add [ARG2], SYSCALL_RETURN
sub [ARG2 + 8], SYSCALL_RETURN
jnz 1b

This is my code. When running the code always guide the output through a pipe like ./a.out|cat. Otherwise, vmsplice will crash.

#define _GNU_SOURCE
#include <stdbool.h>
#include <stdalign.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>

#define S 0x10
alignas(0x1000) static char b[S];

void outw(int n) {
    write(1, b, n);
}

void outv(int n) {
    struct iovec iov = {b, n};
    do {
        if ((n = vmsplice(1, &iov, 1, 0)) < 0) abort();
        iov.iov_base = (char *)iov.iov_base + n;
        iov.iov_len -= n;
    } while (iov.iov_len);
}

#define _(f, n) do {\
    for (int i = 0; i < 3; ++i) {\
        memset(b, i + '0', (n) - 1);\
        b[(n) - 1] = '\n';\
        f(n);\
    }\
} while (false)

int main() {
    _(outw, S);
    _(outw, S - 1);
    write(1, "---\n", 4);
    _(outv, S);
    _(outv, S - 1);
}

The expected output is,

000000000000000
111111111111111
222222222222222
00000000000000
11111111111111
22222222222222
---
000000000000000
111111111111111
222222222222222
00000000000000
11111111111111
22222222222222

but for the second part I get,

22222222222222

22222222222222

22222222222222

22222222222222
22222222222222
22222222222222

When I add this line to the first line of main,

fcntl(1, F_SETPIPE_SZ, S);

the second output is a bit better, but still not good.

111111111111111
222222222222222
000000000000000

11111111111111
22222222222222
22222222222222

I tried matching the size of the buffer to be written to the pipe size by commenting out these lines.

//_(outw, S - 1);
//_(outv, S - 1);

Still, the top and the bottom doesn't match.

000000000000000
111111111111111
222222222222222
---
111111111111111
222222222222222
222222222222222

So what am I doing wrong, and how do I make outv do the same job as outw but without copying?


I kind of solved the problem by setting the buffer size to at least 0x10000 or 65536 and matching the pipe's size as the same. I'm not entirely sure, but it seems that nothing happens before the pipe is full, and when it is full, some routine that is handling the output assumes that it can copy from the same buffer for the previous calls to vmsplice, not caring about whether the contents of the buffer has changed.


I thought I solved the problem, but it was not true. I still get unexpected output in the actual program where I did match the output buffer size and the pipe size to 0x100000. All works fine with write apart from the very slow speed, so the problem is in the way I'm using vmsplice. The man page for this system call isn't clear on what exactly this call is doing and what can happen in what cases.

xiver77
  • 2,162
  • 1
  • 2
  • 12

1 Answers1

1

Your program reuses the same b buffer each call, however vmsplice maps the memory you pass into the pipe without copying it. Consequently, you mustn't modify the spliced data until the reader on the other side of the pipe has read it.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • I'm basing this on various online references ([1](https://lwn.net/Articles/181169/), [2](https://stackoverflow.com/questions/10639559/linux-zero-copy-transfer-memory-pages-between-two-processes-with-vmsplice), [3](https://arstechnica.com/civis/viewtopic.php?f=16&t=86279)), but without having tested it myself. – John Kugelman Dec 30 '21 at 17:19
  • 1
    Sorry for the late reply, somehow I hadn't receive a notification for your answer. I also did many kinds of tests, and the safest way to use `vmsplice` seems to be `mmap` -> `vmplice` with `SPLICE_F_GIFT` -> `munmap`. `munmap` can be called directly after the return from `vmplice` and this works in a similar way to `aio_write` but actually a lot faster. – xiver77 Jan 02 '22 at 03:20
  • The problem is that `mmap`ing a new buffer every time prevents the buffer to be loaded in cache, so I tried to find a way to safely rewrite the buffer after `vmsplice`. Unfortunately I couldn't find a definite answer to this problem. That `vmsplice` has returned does not mean that the pipe has fully consumed the buffer, and I couldn't find a way to check this. – xiver77 Jan 02 '22 at 03:20
  • The weird this is, given that everything is done *single-threaded*, and the buffer size is equal to the pipe size, the time that the pipe consumes the buffer given by `vmsplice` is quite predictable, so overwriting the buffer in a predictable pattern apparently produce a predictable result. – xiver77 Jan 02 '22 at 03:20