I have uploaded a simplified http2 server gist based on another gist from tatsuhiro-t, in order to show you my concerns about something that seems to be a bottleneck that I think it is because boost asio limitation, or perhaps how tatsuhiro wrapper uses that layer behind.
My gist is a variant of tatsuhiro one, where I removed the queue structure of worker threads, as I consider that the problem is not on application side but in epoll/recvfrom loop (I don't want to extend my explanation, but I logged active worker threads in other test, and they never got the expected work from boost asio loop, so I had 1 or 2 threads working at a time).
So, my gist just do respond status code 200 and "done" as body:
curl --http2-prior-knowledge -i http://localhost:3000
HTTP/2 200
date: Sun, 03 Jul 2022 21:13:39 GMT
done
The gist is built with:
nghttp2 version 1.47.0
boost version 1.76.0
Another difference from tatsuhiro-t gist, Is that I hardcoded nghttp2 server threads to 1 instead of 2, but this is not important, as I do h2load benchmarking with only 1 client connection (-c1):
h2load -n 1000 -c1 -m20 http://localhost:3000/
starting benchmark...
spawning thread #0: 1 total client(s). 1000 total requests
Application protocol: h2c
progress: 10% done
...
progress: 100% done
finished in 7.80ms, 128155.84 req/s, 2.94MB/s
requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 23.48KB (24047) total, 1.98KB (2023) headers (space savings 95.30%), 3.91KB (4000) data
min max mean sd +/- sd
time for request: 102us 487us 138us 69us 92.00%
time for connect: 97us 97us 97us 0us 100.00%
time to 1st byte: 361us 361us 361us 0us 100.00%
req/s : 132097.43 132097.43 132097.43 0.00 100.00%
But, the performance impact comes when I send a request body:
small request sized ~22 bytes:
h2load -n 1000 -c1 -m20 -d request22b.json http://localhost:3000/ starting benchmark... spawning thread #0: 1 total client(s). 1000 total requests Application protocol: h2c progress: 10% done ... progress: 100% done finished in 23.83ms, 41965.67 req/s, 985.50KB/s requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx traffic: 23.48KB (24047) total, 1.98KB (2023) headers (space savings 95.30%), 3.91KB (4000) data min max mean sd +/- sd time for request: 269us 812us 453us 138us 74.00% time for connect: 104us 104us 104us 0us 100.00% time to 1st byte: 734us 734us 734us 0us 100.00% req/s : 42444.94 42444.94 42444.94 0.00 100.00%
3K-sized request:
h2load -n 1000 -c1 -m20 -d request3k.json http://localhost:3000/ starting benchmark... spawning thread #0: 1 total client(s). 1000 total requests Application protocol: h2c progress: 10% done ... progress: 100% done finished in 27.80ms, 35969.93 req/s, 885.79KB/s requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx traffic: 24.63KB (25217) total, 1.98KB (2023) headers (space savings 95.30%), 3.91KB (4000) data min max mean sd +/- sd time for request: 126us 5.01ms 528us 507us 94.80% time for connect: 109us 109us 109us 0us 100.00% time to 1st byte: 732us 732us 732us 0us 100.00% req/s : 36451.56 36451.56 36451.56 0.00 100.00%
The request size seems to be irrelevant. The great difference is related to "send" or "not send" a request body:
Without -> ~132k reqs/s
With -> ~40k reqs/s
To get in more detail, I will show strace dumps:
$> strace -e epoll_pwait,recvfrom,sendto -tt -T -f -o strace.log ./server
So, executing: h2load -n 1000 -c1 -m20 -d request3k.json http://localhost:3000/
Here a traces section:
21201 21:37:25.835841 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000020>
21201 21:37:25.835999 sendto(7, "\0\0\2\1\4\0\0\5?\210\276\0\0\2\1\4\0\0\5A\210\276\0\0\2\1\4\0\0\5C\210"..., 360, MSG_NOSIGNAL, NULL, 0) = 360 <0.000030>
21201 21:37:25.836128 recvfrom(7, "878\":13438,\n \"id28140\":16035,\n "..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.836233 epoll_pwait(5, [{EPOLLIN, {u32=531974496, u64=140549041768800}}, {EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 2 <0.000019>
21201 21:37:25.836416 sendto(7, "\0\0\2\1\4\0\0\5]\210\276\0\0\2\1\4\0\0\5_\210\276\0\0\4\0\1\0\0\5]d"..., 48, MSG_NOSIGNAL, NULL, 0) = 48 <0.000029>
21201 21:37:25.836553 recvfrom(7, ":21144,\n \"id9936\":30596,\n \"id1"..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.836655 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}, {EPOLLIN, {u32=531974496, u64=140549041768800}}], 128, 0, NULL, 8) = 2 <0.000019>
21201 21:37:25.836786 sendto(7, "\0\0\2\1\4\0\0\5a\210\276\0\0\4\0\1\0\0\5adone", 24, MSG_NOSIGNAL, NULL, 0) = 24 <0.000035>
21201 21:37:25.836916 recvfrom(7, "31206\":15376,\n \"id32760\":5850,\n"..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.837015 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000019>
21201 21:37:25.837197 sendto(7, "\0\0\4\10\0\0\0\0\0\0\0\212\373", 13, MSG_NOSIGNAL, NULL, 0) = 13 <0.000029>
21201 21:37:25.837319 recvfrom(7, "9\":27201,\n \"id31098\":5087,\n \"i"..., 8192, 0, NULL, NULL) = 8192 <0.000022>
21201 21:37:25.837426 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}, {EPOLLIN, {u32=531974496, u64=140549041768800}}], 128, 0, NULL, 8) = 2 <0.000021>
21201 21:37:25.837546 recvfrom(7, "6,\n \"id25242\":11533,\n\n}\n\0\f2\0\1\0\0"..., 8192, 0, NULL, NULL) = 8192 <0.000023>
21201 21:37:25.837667 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}, {EPOLLIN, {u32=531974496, u64=140549041768800}}], 128, 0, NULL, 8) = 2 <0.000022>
21201 21:37:25.837875 recvfrom(7, "20001\":540,\n \"id12240\":10015,\n "..., 8192, 0, NULL, NULL) = 8192 <0.000032>
21201 21:37:25.837996 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000019>
21201 21:37:25.838127 recvfrom(7, "560,\n \"id24256\":11594,\n \"id719"..., 8192, 0, NULL, NULL) = 8192 <0.000020>
21201 21:37:25.838224 epoll_pwait(5, [{EPOLLIN, {u32=531972776, u64=140549041767080}}], 128, 0, NULL, 8) = 1 <0.000018>
21201 21:37:25.838440 sendto(7, "\0\0\4\10\0\0\0\0\0\0\0\2016", 13, MSG_NOSIGNAL, NULL, 0) = 13 <0.000031>
21201 21:37:25.838585 recvfrom(7, "7,\n \"id2996\":10811,\n \"id22651\""..., 8192, 0, NULL, NULL) = 8192 <0.000020>
It seems that recvfrom is called very late (it takes about 300 usecs) after epoll_pwait, and it reads maximum 8k which means that some more data is queued. There are few places where this value is lesser than 8k.
Now, executing: h2load -n 1000 -c1 -m20 -d request22b.json http://localhost:3000/
, which is the small request body, we could see recvfrom calls reading variable sizes, which seems that everything pending is collected, but there is also delays regarding epoll about 300 microseconds, which seems to be too much:
21878 21:39:35.626192 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000017>
21878 21:39:35.626282 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 59999, NULL, 8) = 1 <0.000016>
21878 21:39:35.626345 epoll_pwait(5, [{EPOLLIN, {u32=524917088, u64=140141012818272}}], 128, 59999, NULL, 8) = 1 <0.000253>
21878 21:39:35.626650 recvfrom(7, "\0\0\n\1\4\0\0\1\31\204\206\277\203\276\17\r\00222\0\0\n\1\4\0\0\1\33\204\206\277\203"..., 8192, 0, NULL, NULL) = 1000 <0.000021>
21878 21:39:35.626899 recvfrom(7, 0x7f751f4860d8, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000016>
21878 21:39:35.626991 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000016>
21878 21:39:35.627202 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 0, NULL, 8) = 1 <0.000016>
21878 21:39:35.627423 sendto(7, "\0\0\2\1\4\0\0\1\31\210\276\0\0\2\1\4\0\0\1\33\210\276\0\0\2\1\4\0\0\1\35\210"..., 480, MSG_NOSIGNAL, NULL, 0) = 480 <0.000037>
21878 21:39:35.627552 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000026>
21878 21:39:35.627709 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 59999, NULL, 8) = 1 <0.000020>
21878 21:39:35.627827 epoll_pwait(5, [{EPOLLIN, {u32=524917088, u64=140141012818272}}], 128, 59999, NULL, 8) = 1 <0.000267>
21878 21:39:35.628176 recvfrom(7, "\0\0\n\1\4\0\0\1A\204\206\277\203\276\17\r\00222\0\0\n\1\4\0\0\1C\204\206\277\203"..., 8192, 0, NULL, NULL) = 1000 <0.000016>
21878 21:39:35.628395 recvfrom(7, 0x7f751f4860d8, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000016>
21878 21:39:35.628475 epoll_pwait(5, [], 128, 0, NULL, 8) = 0 <0.000015>
21878 21:39:35.628657 epoll_pwait(5, [{EPOLLIN, {u32=524915368, u64=140141012816552}}], 128, 0, NULL, 8) = 1 <0.000019>
21878 21:39:35.628862 sendto(7, "\0\0\2\1\4\0\0\1A\210\276\0\0\2\1\4\0\0\1C\210\276\0\0\2\1\4\0\0\1E\210"..., 480, MSG_NOSIGNAL, NULL, 0) = 480 <0.000030>
So, ¿ what is going on to last so much time to get new work ? This explain why worker threads are mostly idle and no difference exists if I add them or not, to the server gist. Not sure about if there is something configurable at boost assio reactors (and this may be arised to tatsuhiro-t wrapper), but I went through its source code and I got lost. On nghtp2 source code, the important part seems to be done here (this other issue could have some relation with the possible bottleneck).
Boost thread on epoll loop is not squeezing the CPU and it is not a problem at client side to send enough traffic (indeed if you send h2load with many clients (-cN) the CPU remains the same and seems to be wasted on the loop). There are time gaps where I don't know what's going on. Why h2load can send data but server can't process it ?
Thank you in advance.
EDIT
As this is confirmed to be reproduced only on VirtualBox machine (I use vagrant to launch on linux host), have anyone an idea of what performance issue is causing this performance impact ?
I tested the binaries on host and behaves correctly. Also over docker image (as expected, because it is the same binary executed on same kernel).
To better troubleshoot, I added printouts on boost sources (boost/asio/detail/impl/epoll_reactor.ipp) and also on tatsuhiro-t lambda commented above:
USECS ENTERING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373265, bytes transferred: 8192
USECS BEFORE deadline_.expires_from_now: 1657370015373274
USECS AFTER deadline_.expires_from_now: 1657370015373278
USECS BEFORE reactor_op::status status = op->perform(): 1657370015373279
USECS ONCE EXECUTED reactor_op::status status = op->perform(): 1657370015373282
USECS AFTER scheduler_.post_immediate_completion(op, is_continuation);: 1657370015373285
USECS EXITING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373287
USECS BEFORE epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373288
USECS AFTER epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373290
USECS ENTERING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373295, bytes transferred: 941
USECS BEFORE deadline_.expires_from_now: 1657370015373297
USECS AFTER deadline_.expires_from_now: 1657370015373299
USECS BEFORE reactor_op::status status = op->perform(): 1657370015373300
USECS EXITING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373302
USECS BEFORE epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373305
USECS AFTER epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373307
USECS BEFORE epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373312
USECS AFTER epoll_wait(epoll_fd_, events, 128, timeout): 1657370015373666
USECS ENTERING socket_.async_read_some LAMBDA ARGUMENT: 1657370015373676, bytes transferred: 8192
...
As you can see, when not all the bytes transferred are 8k, we have epoll lapses about 300us, and that's what could be related to the other issue referenced above.