C++ NTOH conversion with dispatcher - event queue

Question

We are rewriting our legacy code in C to C++. At the core of our system, we have a TCP client, which is connected to master. Master will be streaming messages continuously. Each socket read will result in say an N number of message of the format - {type, size, data[0]}.

Now we don't copy these messages into individual buffers - but just pass the pointer the beginning of the message, the length and shared_ptr to the underlying buffer to a workers.

The legacy C version was single threaded and would do an inplace NTOH conversion like below:

struct Message {
   uint32_t something1; 
   uint16_t something2; 
};

process (char *message)
Message *m = (message);
m->something1 = htonl(m->something1);
m->something2 = htons(m->something2);

And then use the Message.

There are couple of issues with following the logging in new code.

Since we are dispatching the messages to different workers, each worker doing an ntoh conversion will cause cache miss issues as the messages are not cache aligned - i.e there is no padding b/w the messages.
Same message can be handled by different workers - this is the case where the message needs to processed locally and also relayed to another process. Here the relay worker needs the message in original network order and the local work needs to convert to host order. Obviously as the message is not duplicated both cannot be satisfied.

The solutions that comes to my mind are -

Duplicate the message and send one copy for all relay workers if any. Do the ntoh conversion of all messages belonging to same buffer in the dispatcher itself before dispatching - say by calling a handler->ntoh(message); so that the cache miss issue is solved.
Send each worker the original copy. Each worker will copy the message to local buffer and then do ntoh conversion and use it. Here each worker can use a thread-specific (thread_local) static buffer as a scratch pad to copy the message.

Now my question is

Is the option 1 way of doing ntoh conversion - C++sy? I mean the alignment requirement of the structure will be different from the char buffer. (we havent had any issue with this yet.). Using scheme 2 should be fine in this case as the scratch buffer can have alignment of max_align_t and hence should typecastable to any structure. But this incur copying the entire message - which can be quite big (say few K size)
Is there a better way to handle the situation?

IIRC, dereferencing a `Message*` that was cast from a `char*` is undefined behavior in C++, unless the `char*` was originally cast from a `Message*` that pointed to a properly-instantiated instance of `Message`. Just FYI. (Reading from the network into a `char*` and casting the pointer to `Message*` isn't really safe in C++. Unless I'm remembering incorrectly.) — cdhowie, Apr 10 '17 at 19:21
@cdhowie. Okay any idea why is it undefined behavior? Is it okay if I do this - `std::aligned_storage::type buffer; auto m = new (buffer) Message; memcpy(m, message, sizeof(Message));` I guess this would not be either? I cannot use a proper deserializer as the Message struct is not packed :-( Instead the legacy code - on master side does this - `struct Message { uint32_t field1 __attribute__((aligned(4))); uint32_t field2 __attribute__((aligned(2))); }; — MGH, Apr 10 '17 at 19:30
I assume you meant `std::aligned_storage::type buffer;` (swapped template arguments, and used `::type`)? If so, then I would think it is safe, as long as `MAX_SIZE >= sizeof(Message)`. — cdhowie, Apr 10 '17 at 19:33
@cdhowie: Okay thanks! I edited my comment, before I saw your response. Sorry! So probably scheme 2 is the only valid way to do this isn't it? — MGH, Apr 10 '17 at 19:37
I'm not sure, they both are valid enough. I assume your question has more to do with the performance implications of each? E.g. flipping messages that don't need to be (because they are only being relayed) and processing a message that needs to be both relayed (not flipped) and processed locally (flipped)? — cdhowie, Apr 10 '17 at 20:28
Scheme 2 + the aligned_storage... thing we discussed above should be valid right? With scheme 1, we cannot do the aligned_storage thing, as it can only be inplace, unless we allocate memory per message, which I dont want to do. Yes the concern is performance. — MGH, Apr 10 '17 at 21:20
So am I to assume that you want to read multiple messages in one `read()` operation? If you are able to read one message at a time, you can obtain aligned storage immediately following the prior message, within the same buffer. — cdhowie, Apr 10 '17 at 21:25
System call is the biggest bottleneck for us. So we need to read as many messages as possible per read. But I see your point for each read choose an aligned location within the buffer - And I guess then we then safely do a (Message *) typecast. — MGH, Apr 10 '17 at 21:43
Well, right.. but you have to placement-new each `Message` before you read over it, to technically avoid UB. Alternatively, have `Message` accept a `void*` in its constructor and pass it a pointer to the beginning of its message in the buffer, and have accessors that extract and return the data by offset/`ntoh`. The emitted code should wind up being very similar, with similar performance characteristics, but this way you have a class that decodes a buffer, instead of reinterpreting raw bytes as an object of that class. — cdhowie, Apr 10 '17 at 22:03
That is, you would instead have something like `class Message { void const *data; public: explicit Message(void const *p_data) : data(p_data) {} uint32_t something() const { return htonl(*reinterpret_cast(p_data)); } /* other accessors */ };`. Obviously it would still be beneficial to have your data be aligned, but this way at least you are avoiding the UB of casting a `char*` to `Message*` when it may not be practical to placement-new each one prior to reading it out (since you can't control the length of each message). — cdhowie, Apr 10 '17 at 22:04
Also note that the data fields in `data` themselves [have to be appropriately aligned to be able to evaluate `*reinterpret_cast(p_data)`](http://stackoverflow.com/a/32590117/501250) as spec. behavior. You might wind up needing to abandon `htonl` for example and writing your own of the form `inline uint32_t hotn_u32(uint8_t const *p) { return (static_cast(p[0]) << 24) | (static_cast(p[1]) << 16) | (static_cast(p[2]) << 8) | static_cast(p[3]); }` This has the potential to kill performance, but at least it's safe for unaligned access. — cdhowie, Apr 10 '17 at 22:18
Makes sense. But both the above schemes will not work for us, as the structure is not packed. Instead we have - `struct Message { uint32_t something1 __attribute__((aligned(4))); }` So I think the only option is to copy the message to a appropriately aligned buffer and then do ntoh. — MGH, Apr 10 '17 at 22:37
Even if it's not packed, the second approach will work for you because accessing a `uint8_t` is always aligned. Perhaps I should write up a gist? — cdhowie, Apr 11 '17 at 01:14
Okay. I am not sure if I got your point fully. But based on what you said, I edited an old post I had to include the scheme - if you want to take a look - http://codereview.stackexchange.com/questions/159277/forced-and-easy-hton-ntoh-conversion — MGH, Apr 11 '17 at 03:32
I was thinking [something like this](http://coliru.stacked-crooked.com/a/c518e6f8d2c301b7). Note that since the data pointer is always dereferenced as `uint8_t`, there is no alignment violation, ever. The `ntoh` implementation is safe for all host endiannesses. It is sub-optimal on big-endian hosts (where nothing is actually flipped), *except for the fact that even on such hosts it prevents alignment violation problems.* — cdhowie, Apr 11 '17 at 15:59
I have my doubts on `uint16_t Message::something2() { return ntoh_u16(data + 4); }`. The offset of something2 in the buffer will not be data + 4 if the structure is not packed. (The sender is serializing an unpacked struct). Hence in my implementation, I am using offsetof() - these are POD structures to get the offset. — MGH, Apr 11 '17 at 19:39
Hmm, that is a bit worrisome just because the organization is up to the sender's compiler. In theory, a compiler update could restructure your data. But, one solution to the problem at hand would be to declare the same struct in your C++ code but only use `offsetof()` (in `Message::something1/2`) to determine the offset. — cdhowie, Apr 11 '17 at 20:03
Using a `proto` type like that can actually allow you to automatically implement the getters. See [this example](http://coliru.stacked-crooked.com/a/c7a95a72522fec90), where `Message::something1/2` are implemented using a macro with all kinds of type-deduction magic. — cdhowie, Apr 11 '17 at 20:20
This is good. I think we are doing more or less the same thing. But since you put all the hairy parts inside the class - things look much neater. I had it as a macro + a genric wrapper, so that I can use existing C structure and dont have to define a new encapsulating class. But looking at your code, I think its worth defining an encapsulating class. for each message. I will accept the answer if you give this as an answer. — MGH, Apr 11 '17 at 22:20
Yeah, a class per message type was my idea (just giving an example here). Each one has a private `proto` struct that defines the message structure, with the macros to implement the getters. The primary advantage of doing it this way is that alignment issues entirely disappear, at a (probably small?) performance cost. — cdhowie, Apr 11 '17 at 22:48

score 1 · Accepted Answer · answered Apr 11 '17 at 22:55

Your primary issue seems to be how to handle messages that come in misaligned. That is, if each message structure doesn't have enough padding on the end of it so that the following message is properly aligned, you can trigger misaligned reads by reinterpreting a pointer to the beginning of a message as an object.

We can get around this a number of ways, perhaps the simplest would be to ntoh based on a single-byte pointer, which is effectively always aligned.

We can hide the nasty details behind wrapper classes, which will take a pointer to the start of a message and have accessors that will ntoh the appropriate field.

As indicated in the comments, it's a requirement that offsets be determined by a C++ struct, since that's how the message is initially created, and it may not be packed.

First, our ntoh implementation, templated so we can select one by type:

template <typename R>
struct ntoh_impl;

template <>
struct ntoh_impl<uint16_t>
{
    static uint16_t ntoh(uint8_t const *d)
    {
        return (static_cast<uint16_t>(d[0]) << 8) |
                d[1];
    }
};

template <>
struct ntoh_impl<uint32_t>
{
    static uint32_t ntoh(uint8_t const *d)
    {
        return (static_cast<uint32_t>(d[0]) << 24) |
               (static_cast<uint32_t>(d[1]) << 16) |
               (static_cast<uint32_t>(d[2]) <<  8) |
               d[3];
    }
};

template<>
struct ntoh_impl<uint64_t>
{
    static uint64_t ntoh(uint8_t const *d)
    {
        return (static_cast<uint64_t>(d[0]) << 56) |
               (static_cast<uint64_t>(d[1]) << 48) |
               (static_cast<uint64_t>(d[2]) << 40) |
               (static_cast<uint64_t>(d[3]) << 32) |
               (static_cast<uint64_t>(d[4]) << 24) |
               (static_cast<uint64_t>(d[5]) << 16) |
               (static_cast<uint64_t>(d[6]) <<  8) |
               d[7];
    }
};

Now we'll define a set of nasty macros that will automatically implement accessors for a given name by looking up the member with the matching name in the struct proto (a private struct to each class):

#define MEMBER_TYPE(MEMBER) typename std::decay<decltype(std::declval<proto>().MEMBER)>::type

#define IMPL_GETTER(MEMBER) MEMBER_TYPE(MEMBER) MEMBER() const { return ntoh_impl<MEMBER_TYPE(MEMBER)>::ntoh(data + offsetof(proto, MEMBER)); }

Finally, we have an example implementation of the message structure you have given:

class Message
{
private:
    struct proto
    {
        uint32_t something1;
        uint16_t something2;
    };

public:
    explicit Message(uint8_t const *p) : data(p) {}
    explicit Message(char const *p) : data(reinterpret_cast<uint8_t const *>(p)) {}

    IMPL_GETTER(something1)
    IMPL_GETTER(something2)

private:
    uint8_t const *data;
};

Now Message::something1() and Message::something2() are implemented and will read from the data pointer at the same offsets they wind up being in Message::proto.

Providing the implementation in the header (effectively inline) has the potential to inline the entire ntoh sequence at the call site of each accessor!

This class does not own the data allocation it is constructed from. Presumably you could write a base class if there's ownership-maintaining details here.

C++ NTOH conversion with dispatcher - event queue

1 Answers1