epoll_pwait / recvfrom bottleneck on nghttp2-asio server

Question

I have uploaded a simplified http2 server gist based on another gist from tatsuhiro-t, in order to show you my concerns about something that seems to be a bottleneck that I think it is because boost asio limitation, or perhaps how tatsuhiro wrapper uses that layer behind.

My gist is a variant of tatsuhiro one, where I removed the queue structure of worker threads, as I consider that the problem is not on application side but in epoll/recvfrom loop (I don't want to extend my explanation, but I logged active worker threads in other test, and they never got the expected work from boost asio loop, so I had 1 or 2 threads working at a time).

So, my gist just do respond status code 200 and "done" as body:

curl --http2-prior-knowledge -i http://localhost:3000
HTTP/2 200
date: Sun, 03 Jul 2022 21:13:39 GMT

done

The gist is built with:

nghttp2 version 1.47.0
boost version   1.76.0

Another difference from tatsuhiro-t gist, Is that I hardcoded nghttp2 server threads to 1 instead of 2, but this is not important, as I do h2load benchmarking with only 1 client connection (-c1):

h2load -n 1000 -c1 -m20 http://localhost:3000/
starting benchmark...
spawning thread #0: 1 total client(s). 1000 total requests
Application protocol: h2c
progress: 10% done
...
progress: 100% done

finished in 7.80ms, 128155.84 req/s, 2.94MB/s
requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 23.48KB (24047) total, 1.98KB (2023) headers (space savings 95.30%), 3.91KB (4000) data
                     min         max         mean         sd        +/- sd
time for request:      102us       487us       138us        69us    92.00%
time for connect:       97us        97us        97us         0us   100.00%
time to 1st byte:      361us       361us       361us         0us   100.00%
req/s           :  132097.43   132097.43   132097.43        0.00   100.00%

But, the performance impact comes when I send a request body:

small request sized ~22 bytes:

  h2load -n 1000 -c1 -m20 -d request22b.json http://localhost:3000/
  starting benchmark...
  spawning thread #0: 1 total client(s). 1000 total requests
  Application protocol: h2c
  progress: 10% done
  ...
  progress: 100% done

  finished in 23.83ms, 41965.67 req/s, 985.50KB/s
  requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
  status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
  traffic: 23.48KB (24047) total, 1.98KB (2023) headers (space savings 95.30%), 3.91KB (4000) data
                       min         max         mean         sd        +/- sd
  time for request:      269us       812us       453us       138us    74.00%
  time for connect:      104us       104us       104us         0us   100.00%
  time to 1st byte:      734us       734us       734us         0us   100.00%
  req/s           :   42444.94    42444.94    42444.94        0.00   100.00%

3K-sized request:

  h2load -n 1000 -c1 -m20 -d request3k.json http://localhost:3000/
  starting benchmark...
  spawning thread #0: 1 total client(s). 1000 total requests
  Application protocol: h2c
  progress: 10% done
  ...
  progress: 100% done

  finished in 27.80ms, 35969.93 req/s, 885.79KB/s
  requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
  status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
  traffic: 24.63KB (25217) total, 1.98KB (2023) headers (space savings 95.30%), 3.91KB (4000) data
                       min         max         mean         sd        +/- sd
  time for request:      126us      5.01ms       528us       507us    94.80%
  time for connect:      109us       109us       109us         0us   100.00%
  time to 1st byte:      732us       732us       732us         0us   100.00%
  req/s           :   36451.56    36451.56    36451.56        0.00   100.00%

The request size seems to be irrelevant. The great difference is related to "send" or "not send" a request body:

Without -> ~132k reqs/s
With    -> ~40k  reqs/s

To get in more detail, I will show strace dumps:

$> strace -e epoll_pwait,recvfrom,sendto -tt -T -f -o strace.log ./server

So, executing: h2load -n 1000 -c1 -m20 -d request3k.json http://localhost:3000/

Here a traces section:

21201 21:37:25.835841 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000020>
21201 21:37:25.835999 sendto(7, "\0\0\2\1\4\0\0\5?\210\276\0\0\2\1\4\0\0\5A\210\276\0\0\2\1\4\0\0\5C\210"..., 360, MSG_NOSIGNAL, NULL, 0) = 360 <0.000030>
21201 21:37:25.836128 recvfrom(7, "878\":13438,\n  \"id28140\":16035,\n "..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.836233 epoll_pwait(5, [{EPOLLIN, {u32=531974496, u64=140549041768800}}, {EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 2 <0.000019>
21201 21:37:25.836416 sendto(7, "\0\0\2\1\4\0\0\5]\210\276\0\0\2\1\4\0\0\5_\210\276\0\0\4\0\1\0\0\5]d"..., 48, MSG_NOSIGNAL, NULL, 0) = 48 <0.000029>
21201 21:37:25.836553 recvfrom(7, ":21144,\n  \"id9936\":30596,\n  \"id1"..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.836655 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}, {EPOLLIN, {u32=531974496, u64=140549041768800}}], 128, 0, NULL, 8) = 2 <0.000019>
21201 21:37:25.836786 sendto(7, "\0\0\2\1\4\0\0\5a\210\276\0\0\4\0\1\0\0\5adone", 24, MSG_NOSIGNAL, NULL, 0) = 24 <0.000035>
21201 21:37:25.836916 recvfrom(7, "31206\":15376,\n  \"id32760\":5850,\n"..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.837015 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000019>
21201 21:37:25.837197 sendto(7, "\0\0\4\10\0\0\0\0\0\0\0\212\373", 13, MSG_NOSIGNAL, NULL, 0) = 13 <0.000029>
21201 21:37:25.837319 recvfrom(7, "9\":27201,\n  \"id31098\":5087,\n  \"i"..., 8192, 0, NULL, NULL) = 8192 <0.000022>
21201 21:37:25.837426 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}, {EPOLLIN, {u32=531974496, u64=140549041768800}}], 128, 0, NULL, 8) = 2 <0.000021>
21201 21:37:25.837546 recvfrom(7, "6,\n  \"id25242\":11533,\n\n}\n\0\f2\0\1\0\0"..., 8192, 0, NULL, NULL) = 8192 <0.000023>
21201 21:37:25.837667 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}, {EPOLLIN, {u32=531974496, u64=140549041768800}}], 128, 0, NULL, 8) = 2 <0.000022>
21201 21:37:25.837875 recvfrom(7, "20001\":540,\n  \"id12240\":10015,\n "..., 8192, 0, NULL, NULL) = 8192 <0.000032>
21201 21:37:25.837996 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000019>
21201 21:37:25.838127 recvfrom(7, "560,\n  \"id24256\":11594,\n  \"id719"..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.838224 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000018>
21201 21:37:25.838440 sendto(7, "\0\0\4\10\0\0\0\0\0\0\0\2016", 13, MSG_NOSIGNAL, NULL, 0) = 13 <0.000031>
21201 21:37:25.838585 recvfrom(7, "7,\n  \"id2996\":10811,\n  \"id22651\""..., 8192, 0, NULL, NULL) = 8192 <0.000020>

It seems that recvfrom is called very late (it takes about 300 usecs) after epoll_pwait, and it reads maximum 8k which means that some more data is queued. There are few places where this value is lesser than 8k.

Now, executing: h2load -n 1000 -c1 -m20 -d request22b.json http://localhost:3000/, which is the small request body, we could see recvfrom calls reading variable sizes, which seems that everything pending is collected, but there is also delays regarding epoll about 300 microseconds, which seems to be too much:

21878 21:39:35.626192 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000017>        
21878 21:39:35.626282 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 59999, NULL, 8) = 1 <0.000016>
21878 21:39:35.626345 epoll_pwait(5, [{EPOLLIN, {u32=524917088, u64=140141012818272}}], 128, 59999, NULL, 8) = 1 <0.000253>
21878 21:39:35.626650 recvfrom(7, "\0\0\n\1\4\0\0\1\31\204\206\277\203\276\17\r\00222\0\0\n\1\4\0\0\1\33\204\206\277\203"..., 8192, 0, NULL, NULL) = 1000 <0.000021>
21878 21:39:35.626899 recvfrom(7, 0x7f751f4860d8, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000016>
21878 21:39:35.626991 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000016>        
21878 21:39:35.627202 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 0, NULL, 8) = 1 <0.000016>
21878 21:39:35.627423 sendto(7, "\0\0\2\1\4\0\0\1\31\210\276\0\0\2\1\4\0\0\1\33\210\276\0\0\2\1\4\0\0\1\35\210"..., 480, MSG_NOSIGNAL, NULL, 0) = 480 <0.000037>
21878 21:39:35.627552 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000026>        
21878 21:39:35.627709 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 59999, NULL, 8) = 1 <0.000020>
21878 21:39:35.627827 epoll_pwait(5, [{EPOLLIN, {u32=524917088, u64=140141012818272}}], 128, 59999, NULL, 8) = 1 <0.000267>
21878 21:39:35.628176 recvfrom(7, "\0\0\n\1\4\0\0\1A\204\206\277\203\276\17\r\00222\0\0\n\1\4\0\0\1C\204\206\277\203"..., 8192, 0, NULL, NULL) = 1000 <0.000016>
21878 21:39:35.628395 recvfrom(7, 0x7f751f4860d8, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000016>
21878 21:39:35.628475 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000015>        
21878 21:39:35.628657 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 0, NULL, 8) = 1 <0.000019>
21878 21:39:35.628862 sendto(7, "\0\0\2\1\4\0\0\1A\210\276\0\0\2\1\4\0\0\1C\210\276\0\0\2\1\4\0\0\1E\210"..., 480, MSG_NOSIGNAL, NULL, 0) = 480 <0.000030>

So, ¿ what is going on to last so much time to get new work ? This explain why worker threads are mostly idle and no difference exists if I add them or not, to the server gist. Not sure about if there is something configurable at boost assio reactors (and this may be arised to tatsuhiro-t wrapper), but I went through its source code and I got lost. On nghtp2 source code, the important part seems to be done here (this other issue could have some relation with the possible bottleneck).

Boost thread on epoll loop is not squeezing the CPU and it is not a problem at client side to send enough traffic (indeed if you send h2load with many clients (-cN) the CPU remains the same and seems to be wasted on the loop). There are time gaps where I don't know what's going on. Why h2load can send data but server can't process it ?

Thank you in advance.

EDIT

As this is confirmed to be reproduced only on VirtualBox machine (I use vagrant to launch on linux host), have anyone an idea of what performance issue is causing this performance impact ?

I tested the binaries on host and behaves correctly. Also over docker image (as expected, because it is the same binary executed on same kernel).

To better troubleshoot, I added printouts on boost sources (boost/asio/detail/impl/epoll_reactor.ipp) and also on tatsuhiro-t lambda commented above:

USECS ENTERING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373265, bytes transferred: 8192
USECS BEFORE deadline_.expires_from_now: 1657370015373274
USECS AFTER deadline_.expires_from_now: 1657370015373278
USECS BEFORE reactor_op::status status = op->perform(): 1657370015373279
USECS ONCE EXECUTED reactor_op::status status = op->perform(): 1657370015373282
USECS AFTER  scheduler_.post_immediate_completion(op, is_continuation);: 1657370015373285
USECS EXITING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373287
USECS BEFORE epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373288
USECS AFTER epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373290
USECS ENTERING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373295, bytes transferred: 941
USECS BEFORE deadline_.expires_from_now: 1657370015373297
USECS AFTER deadline_.expires_from_now: 1657370015373299
USECS BEFORE reactor_op::status status = op->perform(): 1657370015373300
USECS EXITING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373302
USECS BEFORE epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373305
USECS AFTER epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373307
USECS BEFORE epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373312
USECS AFTER epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373666
USECS ENTERING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373676, bytes transferred: 8192
...

As you can see, when not all the bytes transferred are 8k, we have epoll lapses about 300us, and that's what could be related to the other issue referenced above.

It seems to be a bottleneck in my machine which is a virtualbox with ubuntu bionic. I will find out some more until close this. — eramos, Jul 08 '22 at 14:14
Ok confirmed. epoll/recvfrom is working correctly in host, so it must be a issue with the virtualBox machine (network performance) that I didn't locate still. Also, straces were not very good to analyze because seems to interfer the measures. That's the reason I include traces on boost reactor (https://www.boost.org/doc/libs/1_76_0/boost/asio/detail/impl/epoll_reactor.ipp). On host, reads are more "random" and many times, the queue is exhausted. Worker threads are busy and CPU% is correctly squeezed. — eramos, Jul 09 '22 at 12:17

epoll_pwait / recvfrom bottleneck on nghttp2-asio server

0 Answers0