Unexpected bandwidth with TCP benchmarking in C

Question

So I'm doing some benchmarking of TCP to study the impact of the amount of data transferred per connection over the resulting bandwidth. So I wrote a server and a client in C to measure what I need. I then used a Python script to run the experiments many times (To have a precision of +/- 1% I ran my test for 100s. For each point of data I ran the experiment 33 times to get a decent average.), with different inputs, and to gather the results.

The results I get seem about right (I can even observe the expected plateau at higher amounts of data transferred per connection) with the exception that the bandwidth is only 10% of what it should be...

Because I only have access to one computer to run this benchmark, I'm doing the tests on localhost, but that shouldn't be an issue.

Here are my results:

Graph of benchmark results

As you can see, it seems the best bandwidth I can get is a bit more than 300 MB/s... But if I run a bandwidth test with iperf (I made sure to use the same TCP window size), on localhost I get a bandwidth of about 3 GB/s.

Here is the code of my client:

int main(int argc, char const *argv[])
{
    unsigned int size;
    unsigned int timeout;
    int sockfd;
    struct sockaddr_in server_addr;

    if(argc < 4 || argc > 5 ){
        usage();
        exit(1);
    }

    const char * ip = "127.0.0.1";
    ip = argv[3];

    int port = PORT;
    if(argc == 5) {
        port = atoi(argv[4]);
    }

    size = atoi(argv[1]);
    timeout = atoi(argv[2]);

    unsigned int count = 0;

    struct timespec start, end;

    clock_gettime(CLOCK_MONOTONIC_RAW, &start);
    clock_gettime(CLOCK_MONOTONIC_RAW, &end);

    while((end.tv_sec - start.tv_sec) < timeout) {

        // socket create and varification
        sockfd = socket(AF_INET, SOCK_STREAM, 0);
        if (sockfd == -1) {
            perror("Could not create socket\n");
            exit(0);
        }

        bzero(&server_addr, sizeof(server_addr));

        // assign IP, PORT of the server
        server_addr.sin_family = AF_INET;
        server_addr.sin_addr.s_addr = inet_addr(ip);
        server_addr.sin_port = htons(port);

        // connect the client socket to server socket
        if (connect(sockfd, (struct sockaddr *)&server_addr, sizeof(server_addr)) != 0) {
            perror("connection with the server failed");
            exit(1);
        }

        unsigned int nread = 0;
        unsigned int nreadcum = 0;
        char* buf = malloc(size);
        char* bufbuf = buf;

        while(nreadcum < size){
            nread = read(sockfd, bufbuf, size-nreadcum);
            nreadcum += nread;
            bufbuf+=nread;
        }
        // close connection
        close(sockfd);
        count++;
        clock_gettime(CLOCK_MONOTONIC_RAW, &end);
        free(buf);
    }


    uint64_t sec = (end.tv_sec - start.tv_sec);
    double bandwidth = (count*size)/sec;

    printf("%u,%lf,%u,%lu\n", size, bandwidth, count, sec);

    return 0;
}

And here is the code of my server:

int serv_sock_fd;

int main(int argc, char const *argv[])
{
    int size;
    struct sockaddr_in serv_addr;
    struct sockaddr_in client_addr;
    int bound_port;

    if(argc != 2){
        usage();
        exit(1);
    }

    size = atoi(argv[1]);

    int serv_sock_fd = socket(AF_INET,SOCK_STREAM,0);
    int true = 1;
    setsockopt(serv_sock_fd,SOL_SOCKET,SO_REUSEADDR,&true,sizeof(int));
    if(serv_sock_fd == -1) {
        perror("Failed to open server socket");
        exit(1);
    }

    bzero(&serv_addr, sizeof(serv_addr));

    serv_addr.sin_family = AF_INET;
    serv_addr.sin_addr.s_addr = INADDR_ANY;
    serv_addr.sin_port = htons(PORT);

    // Bind socket to the chosen port
    if ((bound_port = bind(serv_sock_fd, (struct sockaddr *) &serv_addr, sizeof(serv_addr))) <0){
        perror("Could not bind socket to local port");
        exit(1);
    }

    // Listen on the port
    if (listen(serv_sock_fd, 16))
    {
        perror("Could not listen");
        exit(1);
    }

    signal(SIGINT, sigint_handler);

    printf("Waiting for connection on %d ...\n", PORT);

    int returned = 1;


    while(returned) {
        printf(".\n");
        int new_socket_fd;
        unsigned int client_addr_len = sizeof(client_addr);
        if ((new_socket_fd = accept(serv_sock_fd, (struct sockaddr *)&client_addr,
                            &client_addr_len))<0) {
                perror("Could not accept client connection");
                exit(1);
            }
        printf("connection received, start sending ... ");
        char * payload = sequence_payload(size);
        returned = write(new_socket_fd, payload, size);
        printf("finished sending\n");
        printf("Returned value = %d\n", returned);
        close(new_socket_fd);
        free(payload);
    }


    close(serv_sock_fd);
    return 0;
}

char * sequence_payload(int size) {
    char * payload = malloc(size);
    for (int i = 0; i < size; i++)
    {
        payload[i] = i%256;
    }
    return payload;
}

Basically what my code is doing is:

for the server: wait for a client to connect, send a dummy payload of the size wanted and close the connection to that client, repeat.
for the client: open a connection to the server, read what the server is sending until the server closes the connection, repeat until the decided timeout is reached.

To calculate the bandwidth, I can just do (number_of_connections_completed * size_transferred_by_connection) / duration_of_all_transfers. I use python to calculate the bandwidth, to be free of any overflow in C.

TLDR: The bandwidth I get with my C programs is 10 times less than what it should be on localhost. What could be the source of that problem?

Are you sure your iperf results are in bytes, rather than bits per second? — EOF, Apr 11 '20 at 13:45
Yes, actually iperf gives the results in bits per second, I get a little above 30 Gbits/s with iperf, which gives about 3 GBytes/s when dividing the result by 8 — whitsundale, Apr 11 '20 at 13:57
Why are you doing `malloc`, `free`, `connect` and `close` on every iteration when you're trying to measure throughput? You should do as little as possible inside your timing loop, and certainly minimize the number of avoidable syscalls. Connect _once_ and send a load of data. Allocate and initialize your buffer _once_, outside the loop. — Useless, Apr 11 '20 at 14:10
How long are your tests taking? Your time calculations are completing ignoring the `tv_nsec` component, which won't matter too much if your tests take, say, 1,000 seconds per run (but it's till sloppy), but matters incredibly if they take 1 or 2 seconds. — Jonathan Leffler, Apr 11 '20 at 14:11
@Useless indeed, it would be better to just send everything in one go, but that's exactly what I want to show with that benchmark : sending lots of small chinks is bad, so you should send as much as possible. However it shows that after a certain size, the gain in bandwidth is negligible and so if you do want to cut into multiple connections, my experiment show at what size it is best to do it. However you're right, malloc and free should be outside the loop, I'm going to fix that and see if that was what caused the issue — whitsundale, Apr 11 '20 at 14:23
@JonathanLeffler The graph I got was after running every experiment for a duration of 100s (so the precision should be 1%) and I ran each experiment at least 33 times to get an average. I could go even further but that's already over 50h of experiment, so it's getting quite long... — whitsundale, Apr 11 '20 at 14:28
It's a question of what you are trying to measure, @whitsundale. Establishing and closing connections takes non-negligible time. Allocating and freeing memory takes non-negligible time. Bandwidth measurements generally do not include the time required to do such things. And cutting corners on timing computation does not average out if the errors it introduces are non-random. — John Bollinger, Apr 11 '20 at 14:28
It would be worth outlining your test timing strategy in the question – it would avoid needing to answer questions like mine. — Jonathan Leffler, Apr 11 '20 at 14:32
You have the loop `while(nreadcum < size){ nread = read(sockfd, bufbuf, size-nread); nreadcum += nread; bufbuf+=nread; }` — shouldn't the `size - nread` in the `read()` call reference `nreadcum` instead of `nread`? — Jonathan Leffler, Apr 11 '20 at 14:38
@JohnBollinger Actually, at first I wasn't sure where to stop before seeing the plateau, my teacher suggested I stop at 10 time the path MTU (which is 65535 on localhost, in linux) so my last point of data is at 700 kB. However iperf gives a bandwidth of 3 to 4 GB/s so I was expecting to reach at least 90% of that speed. I got 10% instead. — whitsundale, Apr 11 '20 at 14:40
@JonathanLeffler I fixed it, but it doesn't changed the bandwidth, the expected number of read bytes being too big is not that much of a problem since it doesn't increase the number of call to ```read``` — whitsundale, Apr 11 '20 at 15:30
It did invite a buffer overflow, though. Since you know how much data the other end is going to send, it probably doesn't overflow in this case, but if you were dealing with an open-ended amount of data, it could cause trouble. — Jonathan Leffler, Apr 11 '20 at 15:33
I edited the question to add in the code the different improvements suggested so far. I also removed the multiple ```printf``` from the server which helped quite a bit too. I am now at about 500 MB/s — whitsundale, Apr 11 '20 at 15:48
General advice: always be measuring. I'd look at a network capture of iperf and compare it to what your client/server is doing. You said that you used the same tcp window size, is that actually reflected on the wire (e.g. I don't see you setting SO_SNDBUF or SO_RECVBUF in your code)? Note that 10 MSS won't get you past the initial congestion window on a modern TCP implementation/configuration, so you are most likely looking at TCP slow start rather than steady state. — JimD., Apr 11 '20 at 15:52
@whitsundale you completely misunderstand. I'm not talking about the size of the data you write, I'm talking about all the extra work you're including in the timing loop, despite not being the thing you want to time. Call send as many times as you like, with whatever chunk size you like, but _do not_ call `malloc` and `free` every time unless you're actually profiling your allocator. Do not `connect` and `close` every time, keep the established TCP session in a steady state, unless you're actually profiling TCP startup. — Useless, Apr 11 '20 at 16:40
Alright it seems I had misread my results after trying to remove ```malloc``` and ```free``` from my main loop server-side. After removing them from the loop I reach between 4 and 5 GB/s so it seems that it was the issue. Not surprising since I was doing a lot system call just to allocate memory that I already had anyway. I'll post the answser. — whitsundale, Apr 11 '20 at 16:51
@Useless yes, I am profiling TCP startup. ```malloc``` in the loop was indeed the main issue here. Thanks. — whitsundale, Apr 11 '20 at 16:53

score 0 · Accepted Answer · answered Apr 11 '20 at 17:05

0

malloc and free are the main issue here. Since they are system call they take a significant amount of time, and since I am measuring the performance of TCP and not those of memory allocation, malloc and free should be outside my profiled loop. Also, the same thing applies to printf in the server-side loop, while not as bad as malloc, the time it takes to print something on the screen is not something that should be taken into account when measuring the performance of TCP.

answered Apr 11 '20 at 17:05

whitsundale

9
1
6

2

malloc and free aren't system calls themselves. For large allocations, glibc's malloc implementation will just use `mmap(MAP_ANONYMOUS)` instead of moving the break (brk), and free it with `munmap` instead of adding to a user-space free list. So yes those calls do result in system calls after the allocator does some bookkeeping in user-space. You can verify that with `strace ./my_program` to see all the system calls it makes. – Peter Cordes Apr 11 '20 at 17:32

Unexpected bandwidth with TCP benchmarking in C

1 Answers1