Issues with PvP communication via TCP

Question

I have just started working with TCP (and all associated libraries) due to the need to implement communication between two processes over an internet connection. My code works but it is very slow compared to what I (perhaps due to lack of experience) would expect given the network latency and bandwidth. Also, I'm sure there are many other things wrong with the code, which is using the UNIX socket API. I would prefer not to use big libraries (such as Boost) for my project unless there is a very good reason.

I include a minimal working example. It is rather long despite my best efforts to shorten it. However, I think most of the problems should be in the first file (tcp_helpers.h) which is only used by the client and server main programs in a fairly obvious way. The functions there are not fully optimized but I find it hard to believe that is the problem, rather likely are some fundamental flaws in the logic.

I also want to ask some questions relevant to the problem:

For network performance, should I worry about using IPv4 vs IPv6? Could it be that my network dislikes the use of IPv4 somehow and penalized performance?
Since the Socket API emulates a stream, I would think it does not matter if you call send() multiple times on smaller chunks of data or once on a big chunk. But perhaps it does matter and doing it with smaller chunks (I call send for my custom protocol header and the data separately each time) leads to issues?
Suppose that two parties communicate over a network doing work on the received data before sending their next message (as is done in my example). If the two processes take x amount of time on localhost to finish, they should never take longer than (2*x + (network overhead)) on the real network, right? If x is small, making the computations (i.e. work before sending next message) go faster will not help, right?
My example program takes about 4ms when running on localhost and >0.7 seconds when running on the local (university) network I'm using. The local network has ping times (measured with ping) of ( min/avg/max/mdev [ms] = 4.36 / 97.6 / 405. / 86.3 ) and a bandwidth (measured with iperf) of ~70Mbit/s. When runnin the example program on the network I get (measured with wireshark filtering on the port in question) 190 packets with an average throughput of 172kB/s and average packet size ~726 Bytes. Is this realistic? To me it seems like my program should be much faster given these network parameters, despite the fairly high ping time.
Looking at the actual network traffic generated by the example program, I started thinking about all the "features" of TCP that are done under the hood. I read somewhere that many programs use several sockets at the same time "to gain speed". Could this help here, for example using two sockets, each for just one-way communication? In particular, maybe somehow reducing the number of ack packets could help performance?
The way I'm writing messages/headers as structs has (at least) two big problems that I already know. First, I do not enforce network byte order. If one communicating party uses big-endian and the other little-endian, this program will not work. Furthermore, due to struct padding (see catb.org/esr/structure-packing/), the sizes of the structs may vary between implementations or compilers, which would also break my program. I could add something like (for gcc) __attribute__((__packed__)) to the structs but that would make it very compiler specific and perhaps even lead to inefficiency. Are there standard ways of dealing with this issue (I've seen something about aligning manually)? (Maybe I'm looking for the wrong keywords.)

// tcp_helpers.h. // NOTE: Using this code is very ill-advised.
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <unistd.h>  // POSIX specific
#include <sys/socket.h> // POSIX specific
#include <netinet/in.h> // POSIX specific
#include <arpa/inet.h> // POSIX specific
#include <cerrno>  // for checking socket error messages
#include <cstdint> // for fixed length integer types

//////////////////// PROFILING ///////////////////
#include <chrono>
static auto start = std::chrono::high_resolution_clock::now();
void print_now(const std::string &message) {
    auto t2 = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> time_span = t2 - start;
    std::cout << time_span.count() << ": " << message << std::endl;
}
//////////////////// PROFILING ///////////////////

struct TCPMessageHeader {
    uint8_t protocol_name[4];
    uint32_t message_bytes;
};

struct ServerSends {
    uint16_t a;
    uint32_t b;
    uint32_t c;
};

typedef uint8_t ClientSends;

namespace TCP_Helpers {
    template<typename NakedStruct>
    void send_full_message(int fd, TCPMessageHeader header_to_send, const std::vector<NakedStruct> &structs_to_send) {
        print_now("Begin send_full_message");
        if (header_to_send.message_bytes != sizeof(NakedStruct) * structs_to_send.size()) {
            throw std::runtime_error("Struct vector's size does not match the size claimed by message header");
        }
        int bytes_to_send = sizeof(header_to_send);
        int send_retval;
        while (bytes_to_send != 0) {
            send_retval = send(fd, &header_to_send, sizeof(header_to_send), 0);
            if (send_retval == -1) {
                int errsv = errno;  // from errno.h
                std::stringstream s;
                s << "Sending data failed (locally). Errno:" << errsv << " while sending header.";
                throw std::runtime_error("Sending data failed (locally)");
            }
            bytes_to_send -= send_retval;
        }
        bytes_to_send = header_to_send.message_bytes;
        while (bytes_to_send != 0) {
            send_retval = send(fd, &structs_to_send[0], sizeof(NakedStruct) * structs_to_send.size(), 0);
            if (send_retval == -1) {
                int errsv = errno;  // from errno.h
                std::stringstream s;
                s << "Sending data failed (locally). Errno:" << errsv <<
                  " while sending data of size " << header_to_send.message_bytes << ".";
                throw std::runtime_error(s.str());
            }
            bytes_to_send -= send_retval;
        }
        print_now("end send_full_message.");
    }

    template<typename NakedStruct>
    std::vector<NakedStruct> receive_structs(int fd, uint32_t bytes_to_read) {
        print_now("Begin receive_structs");
        unsigned long num_structs_to_read;
        // ensure expected message is non-zero length and a multiple of the SingleBlockParityRequest struct
        if (bytes_to_read > 0 && bytes_to_read % sizeof(NakedStruct) == 0) {
            num_structs_to_read = bytes_to_read / sizeof(NakedStruct);
        } else {
            std::stringstream s;
            s << "Message length (bytes_to_read = " << bytes_to_read <<
              " ) specified in header does not divide into required stuct size (" << sizeof(NakedStruct) << ").";
            throw std::runtime_error(s.str());
        }
        // vector must have size > 0 for the following pointer arithmetic to work 
        // (this method must check this in above code).
        std::vector<NakedStruct> received_data(num_structs_to_read);
        int valread;
        while (bytes_to_read > 0)  // todo need to include some sort of timeout?!
        {
            valread = read(fd,
                           ((uint8_t *) (&received_data[0])) +
                           (num_structs_to_read * sizeof(NakedStruct) - bytes_to_read),
                           bytes_to_read);
            if (valread == -1) {
                throw std::runtime_error("Reading from socket file descriptor failed");
            } else {
                bytes_to_read -= valread;
            }
        }
        print_now("End receive_structs");
        return received_data;
    }

    void send_header(int fd, TCPMessageHeader header_to_send) {
        print_now("Start send_header");
        int bytes_to_send = sizeof(header_to_send);
        int send_retval;
        while (bytes_to_send != 0) {
            send_retval = send(fd, &header_to_send, sizeof(header_to_send), 0);
            if (send_retval == -1) {
                int errsv = errno;  // from errno.h
                std::stringstream s;
                s << "Sending data failed (locally). Errno:" << errsv << " while sending (lone) header.";
                throw std::runtime_error(s.str());
            }
            bytes_to_send -= send_retval;
        }
        print_now("End send_header");
    }

    TCPMessageHeader receive_header(int fd) {
        print_now("Start receive_header (calls receive_structs)");
        TCPMessageHeader retval = receive_structs<TCPMessageHeader>(fd, sizeof(TCPMessageHeader)).at(0);
        print_now("End receive_header (calls receive_structs)");
        return retval;
    }
}

// main_server.cpp
#include "tcp_helpers.h"

int init_server(int port) {
    int server_fd;
    int new_socket;
    struct sockaddr_in address{};
    int opt = 1;
    int addrlen = sizeof(address);
    // Creating socket file descriptor
    if ((server_fd = socket(AF_INET, SOCK_STREAM, 0)) == 0) {
        throw std::runtime_error("socket creation failed\n");
    }

    if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT, &opt, sizeof(opt))) {
        throw std::runtime_error("failed to set socket options");
    }
    address.sin_family = AF_INET;
    address.sin_addr.s_addr = INADDR_ANY;
    address.sin_port = htons(port);
    // Forcefully attaching socket to the port
    if (bind(server_fd, (struct sockaddr *) &address, sizeof(address)) < 0) {
        throw std::runtime_error("bind failed");
    }
    if (listen(server_fd, 3) < 0) {
        throw std::runtime_error("listen failed");
    }
    if ((new_socket = accept(server_fd, (struct sockaddr *) &address, (socklen_t *) &addrlen)) < 0) {
        throw std::runtime_error("accept failed");
    }
    if (close(server_fd)) // don't need to listen for any more tcp connections (PvP connection).
        throw std::runtime_error("closing server socket failed");
    return new_socket;
}

int main() {
    int port = 20000;
    int socket_fd = init_server(port);
    while (true) {
        TCPMessageHeader rcv_header = TCP_Helpers::receive_header(socket_fd);
        if (rcv_header.protocol_name[0] == 0)   // using first byte of header name as signal to end
            break;
        // receive message
        auto rcv_message = TCP_Helpers::receive_structs<ClientSends>(socket_fd, rcv_header.message_bytes);
        for (ClientSends ex : rcv_message) // example "use" of the received data that takes a bit of time.
            std::cout <<  static_cast<int>(ex) << " ";
        std::cout << std::endl << std::endl;

        // send a "response" containing 1000 structs of zeros
        auto bunch_of_zeros = std::vector<ServerSends>(500);
        TCPMessageHeader send_header{"abc", 500 * sizeof(ServerSends)};
        TCP_Helpers::send_full_message(socket_fd, send_header, bunch_of_zeros);

    }
    exit(EXIT_SUCCESS);
}

// main_client.cpp
#include "tcp_helpers.h"

int init_client(const std::string &ip_address, int port) {
    int sock_fd;
    struct sockaddr_in serv_addr{};

    if ((sock_fd = socket(AF_INET, SOCK_STREAM, 0)) < 0) {
        throw std::runtime_error("TCP Socket creation failed\n");
    }
    serv_addr.sin_family = AF_INET;
    serv_addr.sin_port = htons(port);
    // Convert IPv4 address from text to binary form
    if (inet_pton(AF_INET, ip_address.c_str(), &serv_addr.sin_addr) <= 0) {
        throw std::runtime_error("Invalid address/ Address not supported for TCP connection\n");
    }
    if (connect(sock_fd, (struct sockaddr *) &serv_addr, sizeof(serv_addr)) < 0) {
        throw std::runtime_error("Failed to connect to server.\n");
    }
    return sock_fd;
}

int main() {
    // establish connection to server and get socket file descriptor.
    int port = 20000;
    int socket_fd = init_client("127.0.0.1", port);
    for (int i = 0; i < 20; ++i) {  // repeat sending and receiving random data
        // send a message containing 200 structs of zeros
        auto bunch_of_zeros = std::vector<ClientSends>(250);
        TCPMessageHeader send_header{"abc", 250 * sizeof(ClientSends)};
        TCP_Helpers::send_full_message(socket_fd, send_header, bunch_of_zeros);

        // receive response
        TCPMessageHeader rcv_header = TCP_Helpers::receive_header(socket_fd);
        auto rcv_message = TCP_Helpers::receive_structs<ServerSends>(socket_fd, rcv_header.message_bytes);
        for (ServerSends ex : rcv_message) // example "use" of the received data that takes a bit of time.
            std::cout << ex.a << ex.b << ex.c << " ";
        std::cout << std::endl << std::endl;
    }
    auto end_header = TCPMessageHeader{}; // initialized all fields to zero. (First byte of name == 0) is "end" signal.
    TCP_Helpers::send_header(socket_fd, end_header);
    exit(EXIT_SUCCESS);
}

if you have a good bandwith with a large ping and your client is waiting for server answer before sending the next packet, it doesn't matter how big your bandwith is, ping will have its impact. — user, Jul 02 '20 at 07:38
Thanks a lot for your comment! You're saying that my execution time is realistic given the ping time? I have no idea how to estimate or assess this. Are there any sources you could point me to? — Adomas Baliuka, Jul 02 '20 at 07:40

Jeremy Friesner · Answer 1 · 2020-07-06T05:17:49.580

The first thing I would suspect as a cause of the perceived slowness over TCP is Nagle's algorithm; if it is enabled on your TCP sockets (which it is, by default), then it can add up to 200mS of latency to a send() call. If it is enabled, try disabling it (as shown in the code below) and seeing if that makes things faster for you.

// Disable Nagle's algorithm for TCP socket (s)
const int enableNoDelay = 1;
if (setsockopt(s, IPPROTO_TCP, TCP_NODELAY, &enableNoDelay, sizeof(enableNoDelay)) != 0) 
{
   perror("setsockopt");
}

For network performance, should I worry about using IPv4 vs IPv6? Could it be that my network dislikes the use of IPv4 somehow and penalized performance?

As far as performance goes, IPv4 and IPv6 are similar; their differences lie more in the areas of ease-of-configuration; use whichever is better for your use-case; neither will be significantly faster or slower than the other. (For maximum flexibility, I recommend supporting both; that is easily done under any dual-stack OS by writing your program to use IPv6, and then enabling IPv4-mapped IPv6-addresses so that your IPv6 sockets can communicate over IPv4 also)

Since the Socket API emulates a stream, I would think it does not matter if you call send() multiple times on smaller chunks of data or once on a big chunk. But perhaps it does matter and doing it with smaller chunks (I call send for my custom protocol header and the data separately each time) leads to issues?

It doesn't matter much when Nagle's algorithm is enabled; Nagle's algorithm is in fact used to collect as much data as possible into a single packet before sending it across the network (analogous to how the parking shuttle at the airport will sometimes wait for a few minutes to collect more passengers before driving to the parking lot). That improves efficiency, since larger packets have a better payload-to-overhead ratio than smaller ones do, but at the cost of increasing latency. Turning off Nagle's algorithm will prevent the delay from occurring, which means that it's more likely that your data will go out to the network right away, but it's also more likely that many of the outgoing packets will be very small. If you want to be optimal, you can manage the enabling and disabling of Nagle's algorithm dynamically, so that you get both the improved efficiency of larger packets and the low latency of immediately sending packets.

Suppose that two parties communicate over a network doing work on the received data before sending their next message (as is done in my example). If the two processes take x amount of time on localhost to finish, they should never take longer than (2*x + (network overhead)) on the real network, right? If x is small, making the computations (i.e. work before sending next message) go faster will not help, right?

TCP isn't a real-time protocol; in particular it prioritizes correct transmission over bounded transmission time. That means any TCP transmission can, in principle, take any amount of time to complete, since the job isn't done until the data gets to the receiving program, and if the network is dropping the packets, the TCP stack will have to keep resending them until they finally get there. You can test this yourself by setting up a TCP data transfer between one computer and another and then pulling out the Ethernet cable for a few seconds during the transfer -- note that the transmission "pauses" when the cable is disconnected, and then resumes (starting slowly and building up speed again), without any data loss, after the cable is reconnected.

That said, it sounds like a case for Amdahl's Law, which (broadly paraphrased) says that speeding up a part of an operation that is already quick won't reduce the speed of the full sequence much; since the slow parts of the sequence remain unchanged and still represent the bulk of the time spent. That sounds like the case in your example.

My example program takes about 4ms when running on localhost and >0.7 seconds when running on the local (university) network I'm using. The local network has ping times (measured with ping) of ( min/avg/max/mdev [ms] = 4.36 / 97.6 / 405. / 86.3 ) and a bandwidth (measured with iperf) of ~70Mbit/s. When running the example program on the network I get (measured with wireshark filtering on the port in question) 190 packets with an average throughput of 172kB/s and average packet size ~726 Bytes. Is this realistic?

It sounds sub-optimal to me; if you can run another program (e.g. iperf or scp or whatever) that uses TCP to transfer data at 70Mbit/sec, then there is no reason your own program shouldn't be able to do the same thing on the same hardware, once it has been properly written and the bottlenecks removed. But you won't usually get optimum performance from a naively written program; it will require some tuning and understanding of what the bottlenecks are and how to remove them, first.

To me it seems like my program should be much faster given these network parameters, despite the fairly high ping time.

Keep in mind that if program A sends data to program B and then waits for program B to respond, that requires a full round-trip across the network, which in the optimal case will be twice the network's ping time. If Nagle's algorithm is enabled on both sides, it could end up being as much as 400mS longer than that.

Looking at the actual network traffic generated by the example program, I started thinking about all the "features" of TCP that are done under the hood. I read somewhere that many programs use several sockets at the same time "to gain speed". Could this help here, for example using two sockets, each for just one-way communication? In particular, maybe somehow reducing the number of ack packets could help performance?

Not really, no. Regardless of how many (or how few) TCP connections you set up, all the data has to go across the same physical hardware; so having multiple TCP connections just divides up the same-sized pie into smaller slices. The only time it might be helpful is if you wanted the ability to deliver messages out-of-order (e.g. to send high-priority command-messages asynchronously to your bulk at a transfer), since a single TCP connection always delivers data in strict FIFO order, whereas the data in TCP connection B can often go ahead and be sent right now, even if there is a big traffic backlog in TCP connection A.

I wouldn't try to implement this until you have more experience with TCP; high bandwidth and low latency is possible using a single TCP connection, so get that optimized first, before trying anything more elaborate.

Keep in mind also that if you are doing bi-directional communication and using blocking I/O calls to do it, then whenever a program is blocking inside recv(), it has to wait until some data has been received before the recv() call will return, and during that time it can't be calling send() to feed more outgoing data to the network. Similarly, anytime the program is blocked inside of send() (waiting for the socket's outgoing-data-buffer to drain enough to fit the data from the send() call into it), the program is blocked and can't do anything until send() returns; in particular it can't call recv() to receive incoming data during that time. This half-duplex behavior can limit data throughput significantly; ways around it include using non-blocking I/O calls rather than blocking I/O, or using multiple threads, or using asynchronous I/O calls (any of those options will require significant redesign of the program, though).

Are there standard ways of dealing with [endian-ness and alignment/packing issues] (I've seen something about aligning manually)? (Maybe I'm looking for the wrong keywords.)

There are standard (or at least, publicly available) ways to handle these issues; the keyword you want is "data serialization"; i.e. the process of turning a data object into a well-defined series of bytes (so that you can send the bytes over the network), and then "data deserialization" (where the receiving program converts that series of bytes back into a data object identical to the one that the sender sent). These steps aren't rocket-science but they can be a bit tricky to get 100% right, so you might look into a prepared solution like Google's Protocol Buffers library to handle the tedious parts for you. But if you're really keen to do it all yourself, have a look at this question and its answers for some examples of how you might accomplish that.

The settings (such as Nagle's algorithm) need to be applied to every socket separately and are not inherited by sockets that come from accepted connections, correct? Under this assumption, disabling Nagle's algorithm worked for me and gave a significant performance increase (something like 5x faster). Do you think by using UDP another *significant* performance increase may be possible? In any case, thanks for the help! I think I will still post my network code on codereview since it continues to look very fishy. — Adomas Baliuka, Jul 09 '20 at 18:54
Correct, each socket has its own independent Nagle-enabled/disabled setting. — Jeremy Friesner, Jul 09 '20 at 18:58
Using UDP might or might not allow you to improve performance further; the main difference in performance will come when a packet gets dropped. In TCP, when that happens, data-transfer of the stream temporarily pauses while the dropped packet is re-sent -- since TCP data is always transferred reliably and in FIFO order, there is no other option. With UDP, a dropped packet is simply ignored, and life goes on without the receiver ever getting the data that was in that packet. If that's okay with you, then UDP might be a good option; OTOH if you need to receive all the data, TCP is better. — Jeremy Friesner, Jul 09 '20 at 19:00
TCP also manages the transmission-rate of the transfer to match it as closely as possible to the available bandwidth, by reducing the transfer rate once packets start getting dropped and increasing it when they aren't, until it finds the (hopefully) optimal transmission rate. With UDP, the sender can send packets as quickly or as slowly as it likes, but sending them too quickly can cause lots of packets to start getting dropped due to congestion, which somewhat defeats the purpose. — Jeremy Friesner, Jul 09 '20 at 19:02

Useless · Accepted Answer · 2020-07-06T09:41:27.500

You care about latency, so the first thing to do is always make sure Nagle's algorithm is disabled, with TCP_NODELAY. The other answer shows how.

Nagle's algorithm explicitly optimises for throughput at the expense of latency, when you want the opposite.

I also want to ask some questions relevant to the problem:

I wish you wouldn't - it makes this question a monster to answer completely.

For network performance, should I worry about using IPv4 vs IPv6? Could it be that my network dislikes the use of IPv4 somehow and penalized performance?

There's no obvious reason it should matter, and if anything the v4 stack may be better optimized because it is still (at the time of writing) more heavily used.

If you want to test, though, you're already using iperf - so compare v4 and v6 performance on your network yourself. Ask a separate question about it if you don't understand the results.

Since the Socket API emulates a stream, I would think it does not matter if you call send() multiple times on smaller chunks of data or once on a big chunk. But perhaps it does matter and doing it with smaller chunks (I call send for my custom protocol header and the data separately each time) leads to issues?

Of course it makes a difference.

Firstly, consider that the network stack needs somehow to decide how to divide that stream into packets. With Nagle's algorithm, this is done by waiting for a timer (or the next ack, which is why it interacts with the client's delayed ack timer as well). With TCP_NODELAY, each call to send() will typically result in its own packet.

Since packets have headers, sending the same amount of user data in more packets uses more network bandwidth. By default, the tradeoff between latency and throughput efficiency is handled by Nagle's algorithm and the delayed ack timer. If you disable Nagle's algorithm, you control the tradeoff manually so you can do what is best for your program - but it is a tradeoff, and requires some thought and effort.

Secondly, the call to send() itself is not free. System calls are more expensive than user-space library calls.

Suppose that two parties communicate over a network doing work on the received data before sending their next message (as is done in my example). If the two processes take x amount of time on localhost to finish, they should never take longer than (2*x + (network overhead)) on the real network, right? If x is small, making the computations (i.e. work before sending next message) go faster will not help, right?

Your estimate looks plausible, but - time is time. Just because total latency is dominated by the network, doesn't mean a speedup to your local computations has no effect.

If you make the computation 1ns faster, it's still 1ns faster overall even if the network latency is 10ms. You also simply have less direct control over the network latency, so may need to save time where you're able.

... To me it seems like my program should be much faster given these network parameters, despite the fairly high ping time.

Yes it should - try again with TCP_NODELAY and the correct number of send() calls.

... Could this help here, for example using two sockets, each for just one-way communication? In particular, maybe somehow reducing the number of ack packets could help performance?

Acks are essentially free for symmetric two-way communication, due to the delayed ack timer. Your Wireshark investigation should have shown this. They are not free for one-way streams, so using two half-duplex sockets is much worse.

The way I'm writing messages/headers as structs has (at least) two big problems that I already know. First, I do not enforce network byte order. If one communicating party uses big-endian and the other little-endian, this program will not work. Furthermore, due to struct padding (see [catb.org/esr/structure-packing/][1]), the sizes of the structs may vary between implementations or compilers, which would also break my program. I could add something like (for gcc) __attribute__((__packed__)) to the structs but that would make it very compiler specific and perhaps even lead to inefficiency. Are there standard ways of dealing with this issue (I've seen something about aligning manually)? (Maybe I'm looking for the wrong keywords.)

There are so many standard ways of handling these issues, there is nothing resembling a single standard.

Endianness - the simplest approach is to take your current host's native byte order, and use that. If you connect a host with a different order, that will need to do extra work, but it may well never happen and you defer the extra effort.
Padding:

Using __attribute__((packed)) or #pragma pack certainly can cause some inefficiency, but it's convenient. Just note that pointers and references to misaligned fields are not required to work correctly, so these structs are not really general-purpose.

Manual padding is do-able but tedious. You just need to figure out the actual alignment of each field in your natively laid-out struct, and then insert padding bytes so that no other implementation could lay it out differently. You may be able to use the alignas specifier to achieve the same thing in a nicer way.

A simple way to get most of your alignment for free is to always arrange fields from largest to smallest (both size and alignment, but they're usually correlated).
Generally serialization is the name given to converting native data to a wire format (and deserialisation for the converse). This covers the whole gamut from converting your data to/from JSON strings for very wide compatibility to sending precisely-laid-out binary data. Your latency constraints put you at the latter end.

Issues with PvP communication via TCP

2 Answers2