Is it possible to ask Linux to blackhole bytes during a socket read?

Question

I have a c++ program running under Linux Debian 9. I'm doing a simple read() from a file descriptor:

int bytes_read = read(fd, buffer, buffer_size);

Imagine that I want to read some more data from the socket, but I want to skip a known number of bytes before getting to some content I'm interested in:

int unwanted_bytes_read = read(fd, unwanted_buffer, bytes_to_skip);

int useful_bytes = read(fd, buffer, buffer_size);

In Linux, is there a system-wide 'built-in' location that I can dump the unwanted bytes into, rather than having to maintain a buffer for unwanted data (like unwanted_buffer in the above example)?

I suppose what I'm looking for would be (sort of) the opposite of MSG_PEEK in the socket world, i.e. the kernel would purge bytes_to_skip from its receive buffer before the next useful call to recv.

If I were reading from a file then lseek would be enough. But this is not possible if you are reading from a socket and are using scatter/gather I/O, and you want to drop one of the fields.

I'm thinking about something like this:

// send side
int a = 1;
int b = 2;
int c = 3;
struct iovec iov[3];
ssize_t nwritten;

iov[0].iov_base = &a;
iov[0].iov_len  = sizeof(int);
iov[1].iov_base = &b;
iov[1].iov_len  = sizeof(int);
iov[2].iov_base = &c;
iov[2].iov_len  = sizeof(int);

nwritten = writev(fd, iov, 3);

// receive side
int a = -1;
int c = -1;
struct iovec iov[3]; // you know that you'll be receiving three fields and what their sizes are, but you don't care about the second.
ssize_t nread;

iov[0].iov_base = &a;
iov[0].iov_len  = sizeof(int);
iov[1].iov_base = ??? <---- what to put here?
iov[1].iov_len  = sizeof(int);
iov[2].iov_base = &c;
iov[2].iov_len  = sizeof(int);

nread = readv(fd, iov, 3);

I know that I could just create another b variable on the receive side, but if I don't want to, how can I read the sizeof(int) bytes that it occupies in the file but just dump the data and proceed to c? I could just create a generic buffer to dump b into, all I was asking is if there is such a location by default.

[EDIT]

Following a suggestion from @inetknght, I tried memory mapping /dev/null and doing my gather into the mapped address:

int nullfd = open("/dev/null", O_WRONLY);
void* blackhole = mmap(NULL, iov[1].iov_len, PROT_WRITE, MAP_SHARED, nullfd, 0);

iov[1].iov_base = blackhole;    

nread = readv(fd, iov, 3);

However, blackhole comes out as 0xffff and I get an errno 13 'Permission Denied'. I tried running my code as su and this doesn't work either. Perhaps I'm setting up my mmap incorrectly?

It doesn't matter particularly. Could be a flat binary of sequential frames with predefined fields and you want to skip certain fields of each frame. — user12066, May 14 '19 at 11:05
@Lightness Races in Orbit, while you are correct that the duplicate question you link to does solve my problem, it doesn't answer the question posed 'In Linux, is there a system-wide 'built-in' location that I can dump the unwanted bytes into'. Presumably the answer is no. — user12066, May 14 '19 at 11:08
@user12066 Why would you want to actually extract the data just to dump it into a black hole? Simply skip past it... — Lightness Races in Orbit, May 14 '19 at 11:51
@curiousguy, it could be a socket if the frames are delimited. For the purpose of the question I don't think it matters. — user12066, May 15 '19 at 08:42
@LightnessRacesinOrbit: I have modified the question to show an alternative use case where I believe lseek is not appropriate. — user12066, May 15 '19 at 08:56
So you want to drop bytes from a socket at the OS API level? — Lightness Races in Orbit, May 15 '19 at 10:33
@LightnessRacesinOrbit: Yes. See example in edited question. — user12066, May 15 '19 at 10:56
Now it's an interesting question! I don't believe you can do this, but I'm going to tighten up the framing of your question a little, and re-open it. — Lightness Races in Orbit, May 15 '19 at 11:02
@LightnessRacesinOrbit: Thank you. One point worth clarifying: If I had used writev to write an iov to a file descriptor opened on a *file*, then I will be able to readv from it in exactly the same way, no? In that case, there is nothing socket-specific about the question. — user12066, May 15 '19 at 11:25
There's nothing socket-specific about the solution you're looking for, true, but the fact you need it to work with sockets is an important constraint (e.g. it rules out seeking!) so it seems like the best way to frame the question. YMMV. — Lightness Races in Orbit, May 15 '19 at 11:36
@user12066 you pose an interesting question! I'm not able to look into it myself at the moment but given that you're using scatter/gather I/O, you might try memory mapping `/dev/null` and providing the returned address as the destination gather location for the bytes you don't care about. It's not really an elegant solution though and doesn't quite tell the OS to ignore the bytes. It just gives a place for the OS to dump them to bitbucket. — inetknght, May 21 '19 at 23:38
@inetknght: Thank you for the suggestion; I will try it out. — user12066, May 22 '19 at 08:43
"I know that I could just create another b variable on the receive side, but if I don't want to" Why? What is the problem about this? That information has arrived, been through a lot of processing. That little memory copy is almost nothing compared to all the other stuff it went through. If that's unwanted data, try to not send it instead. — geza, May 22 '19 at 09:12
@geza I can imagine many scenarios where the developer is unable to change a server's code to _not_ send unnecessary bytes. Indeed, it's very likely given the modern web. — inetknght, May 22 '19 at 16:16
If you were actually in the socket world, you could use the **`MSG_TRUNC` flag** to `recv` and pass a null pointer as the buffer. This will cause the kernel to simply discard the received bytes of data, rather than copying them into the caller-specified buffer. It is not clear from the question whether you are *actually* using a socket, or need a solution that will work for any generic `read` call. (Note also that `MSG_TRUNC` is not portable. But it will work on Linux.) — Cody Gray - on strike, Apr 17 '21 at 08:11

score 3 · Accepted Answer · answered May 22 '19 at 16:54

There's a tl;dr at the end.

In my comment, I suggested you mmap() the /dev/null device. However it seems that device is not mappable on my machine (err 19: No such device). It looks like /dev/zero is mappable though. Another question/answer suggests that is equivalent to MAP_ANONYMOUS which makes the fd argument and its associated open() unnecessary in the first place. Check out an example:

#include <iostream>
#include <cstring>
#include <cerrno>
#include <cstdlib>

extern "C" {
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <fcntl.h>
}

template <class Type>
struct iovec ignored(void *p)
{
    struct iovec iov_ = {};
    iov_.iov_base = p;
    iov_.iov_len = sizeof(Type);
    return iov_;
}

int main()
{
    auto * p = mmap(nullptr, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if ( MAP_FAILED == p ) {
        auto err = errno;
        std::cerr << "mmap(MAP_PRIVATE | MAP_ANONYMOUS): " << err << ": " << strerror(err) << std::endl;
        return EXIT_FAILURE;
    }

    int s_[2] = {-1, -1};
    int result = socketpair(AF_UNIX, SOCK_STREAM, 0, s_);
    if ( result < 0 ) {
        auto err = errno;
        std::cerr << "socketpair(): " << err << ": " << strerror(err) << std::endl;
        return EXIT_FAILURE;
    }

    int w_[3] = {1,2,3};
    ssize_t nwritten = 0;
    auto makeiov = [](int & v){
        struct iovec iov_ = {};
        iov_.iov_base = &v;
        iov_.iov_len = sizeof(v);
        return iov_;
    };
    struct iovec wv[3] = {
        makeiov(w_[0]),
        makeiov(w_[1]),
        makeiov(w_[2])
    };

    nwritten = writev(s_[0], wv, 3);
    if ( nwritten < 0 ) {
        auto err = errno;
        std::cerr << "writev(): " << err << ": " << strerror(err) << std::endl;
        return EXIT_FAILURE;
    }

    int r_ = {0};
    ssize_t nread = 0;
    struct iovec rv[3] = {
        ignored<int>(p),
        makeiov(r_),
        ignored<int>(p),
    };

    nread = readv(s_[1], rv, 3);
    if ( nread < 0 ) {
        auto err = errno;
        std::cerr << "readv(): " << err << ": " << strerror(err) << std::endl;
        return EXIT_FAILURE;
    }

    std::cout <<
        w_[0] << '\t' <<
        w_[1] << '\t' <<
        w_[2] << '\n' <<
        r_ << '\t' <<
        *(int*)p << std::endl;

    return EXIT_SUCCESS;
}

In the above example you can see that I create a private (writes won't be visible by children after fork()) anonymous (not backed by a file) memory mapping of 4KiB (one single page size on most systems). It's then used twice to provide a write destination for two ints -- the later int overwriting the earlier one.

That doesn't exactly solve your question: how to ignore the bytes. Since you're using readv(), I looked into its sister function, preadv() which on first glance appears to do what you want it to do: skip bytes. However, it seems that's not supported on socket file descriptors. The following code gives preadv(): 29: Illegal seek.

rv = makeiov(r_[1]);
nread = preadv(s_[1], &rv, 1, sizeof(int));
if ( nread < 0 ) {
    auto err = errno;
    std::cerr << "preadv(): " << err << ": " << strerror(err) << std::endl;
    return EXIT_FAILURE;
}

So it looks like even preadv() uses seek() under the hood which is, of course, not permitted on a socket. I'm not sure if there is (yet?) a way to tell the OS to ignore/drop bytes received in an established stream. I suspect that's because @geza is correct: the cost to write to the final (ignored) destination is extremely trivial for most situations I've encountered. And, in the situations where the cost of the ignored bytes is not trivial, you should seriously consider using better options, implementations, or protocols.

tl;dr:

Creating a 4KiB anonymous private memory mapping is effectively indistinguishable from contiguous-allocation containers (there are subtle differences that aren't likely to be important for any workload outside of very high end performance). Using a standard container is also a lot less prone to allocation bugs: memory leaks, wild pointers, et al. So I'd say KISS and just do that instead of endorsing any of the code I wrote above. For example: std::array<char, 4096> ignored; or std::vector<char> ignored{4096}; and just set iovec.iov_base = ignored.data(); and set the .iov_len to whatever size you need to ignore (within the length of the container).

To be clear, `mmap` of private, anonymous memory doesn't actually black hole it. That's what memory allocators do to satisfy large allocation requests (and depending on design, to allocate chunks from which to satisfy smaller requests). Writing to it is effectively the same as writing to a stack allocated array or a `malloc`/`new`-ed buffer (a little slower than the stack on first write actually, since Linux anonymous mmaps are lazily copied-on-write from mappings of the zero page, so the first write has to make real zeroed memory available). — ShadowRanger, May 22 '19 at 17:05
No need for that `extern "C"`. You should take a look into those header files and spot `__BEGIN_DECLS` and `__END_DECLS`. — Maxim Egorushkin, May 22 '19 at 17:27
@inetknght: I guess I'm happy enough with your KISS solution above. The only caveat being that if the data I want to blackhole is over 4kb, I'd end up with a memory overflow. — user12066, May 22 '19 at 21:51
@ShadowRanger: per your comment, presumably the call to mmap sets the size of the anonymous buffer, so I'd still have to specify this to be at least the size of the variable I want to disregard in order to prevent overflow? — user12066, May 22 '19 at 21:51
@user12066 you can either resize the ignored buffer container or, if you're still using scatter/gather, you can repeat the same ignored buffer location as many times as needed. See how I use `ignored(p)` twice when constructing my `iovec` array passed to `readv()`? Both constructed `iovec` structs point to the same address, so the OS just overwrites previous (ignored) data. That's not _efficient_ but it is pretty simple. — inetknght, May 22 '19 at 22:30
@inetknght: I tried your solution using MAP_ANONYMOUS and that worked. I also tried using a local stack buffer, which also worked. They both give roughly the same overhead too in terms of latency. — user12066, May 23 '19 at 16:22

Maxim Egorushkin · Answer 2 · 2019-05-22T20:16:16.003

1

The efficient reading of data from a socket is when:

The user-space buffer size is the same or larger (SO_RCVBUF_size + maximum_message_size - 1) than that of the kernel socket receive buffer. You can even map buffer memory pages twice contiguously to make it a ring-buffer to avoid memmoveing incomplete messages to the beginning of the buffer.
The reading is done in one call of recv. This minimizes the number of syscalls (which are more expensive these days due to mitigations for Spectre, Meltdown, etc..). And also prevents starvation of other sockets in the same event loop, which can happen if the code repeatedly calls recv on the same socket with small buffer size until it fails with EAGAIN. As well as guarantees that you drain the entire kernel receive buffer in one recv syscall.

If you do the above, you should then interpret/decode the message from the user-space buffer ignoring whatever is necessary.

Using multiple recv or recvmsg calls with small buffer sizes is sub-optimal with regards to latency and throughput.

edited May 22 '19 at 20:16

answered May 22 '19 at 17:37

Maxim Egorushkin

131,725
17
180
271

2

Even more efficient *could* be use of zero-copy buffers (`SOCK_ZEROCOPY` socket option and `MSG_ZEROCOPY` flag for `send()`). Those aren't well-documented in the man pages though, and spelunking through the linux kernel code to learn how that works and debug any issues might not be for the faint of heart. – inetknght May 22 '19 at 19:09
1

@inetknght Yes, absolutely. And on Intel hardware one can use https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/ – Maxim Egorushkin May 22 '19 at 20:04
@MaximEgorushkin: I got interested in readv/writev because man pages implied their use could be more efficient than read or recv. To use TCP as an example, my understanding was that the order of events would be: 1) read/recv: copies data from the kernel buffer into a OS-layer buffer. 2) the buffer contents copied piece-by-piece into some variables using memcpy, say members of a struct. With this sequence, there are effectively 1 + N memcpy's - one in kernel space, and N in user space, one per variable. (see next comment for continuation) – user12066 May 22 '19 at 21:41
@MaximEgorushkin: (continued) What I don't understand is why the method above is more efficient than using a single readv, which I believe only (effectively) uses N x memcpy in kernel space to transfer data directly from the kernel socket buffer into variables? – user12066 May 22 '19 at 21:43
2

@user12066 `read` is `readv` with 1 buffer. Processing each additional buffer has non-0 overhead because the kernel has to check that each buffer is indeed a valid virtual memory region. – Maxim Egorushkin May 22 '19 at 22:23
@MaximEgorushkin: Ok, that makes sense - thank you. So it's the virtual memory region checking that causes the additional latency for readv. I won't mark your answer as the accepted answer because technically it didn't answer the question I posed. However your comments make it clear why readv may not be the most efficient way to do this. – user12066 May 22 '19 at 22:33

Is it possible to ask Linux to blackhole bytes during a socket read?

2 Answers2