UDP server and connected sockets

Question

[edit] Seems my question was asked nearly 10 years ago here...

Emulating accept() for UDP (timing-issue in setting up demultiplexed UDP sockets)

...with no clean and scalable solution. I think this could be solved handily by supporting listen() and accept() for UDP, just as connect() is now. [/edit]

In a followup to this question...

Can you bind() and connect() both ends of a UDP connection

...is there any mechanism to simultaneously bind() and connect()?

The reason I ask is that a multi-threaded UDP server may wish to move a new "session" to its own descriptor for scalability purposes. The intent is to prevent the listener descriptor from becoming a bottleneck, similar to the rationale behind SO_REUSEPORT.

However, a bind() call with a new descriptor will take over the port from the listener descriptor until the connect() call is made. That provides a window of opportunity, albeit briefly, for ingress datagrams to get delivered to the new descriptor queue.

This window is also a problem for UDP servers wanting to employ DTLS. It's recoverable if the clients retry, but not having to would be preferable.

`The intent is to prevent the listener descriptor from becoming a bottleneck` - Can you describe how you think this might happen? The descriptor will never be the bottleneck - unless you are doing processing on the same thread that has the socket bound, and you do not pull messages off of the OS's queue in time, even then, there is no (bottleneck) per-se, you are just going to throw away incoming data. The thread that binds to the socket should _only_ be listening for the incoming packets and then dispatching them to other worker threads. — Matt Clark, Jan 12 '19 at 01:22
@MattClark Correct. A single thread servicing the listener descriptor is the potential bottleneck. Cleanly moving a "session" to its own descriptor allows the ingress workload to be distributed across multiple threads. SO_REUSEPORT does something similar, hashing datagrams across sockets bound to the same port, but hashing is also a weakness because it changes as the number of sockets goes up or down. — user1715587, Jan 12 '19 at 02:08
@MattClark Just to clarify, I'm taking about servers with dozens of CPUs that process several tens of thousands of datagrams per second. A single thread is going to be a bottleneck even if all it does is copy datagrams from kernel space to user space and hand them off to other threads for processing. — user1715587, Jan 12 '19 at 02:23
Thinking on it a bit... since connect() is available for UDP sockets, I wonder why listen() and/or accept() were not made available too. The accept() call could be used to complete a "connected" UDP session, and it could also move the datagram that triggered the accept() to the new descriptor. — user1715587, Jan 12 '19 at 02:48
Not a portable solution, but on Darwin, `connectx()` is available: "connectx() may be used as a substitute for cases when bind(2) and connect(2) are issued in succession" — Shayan Shahsiah, Aug 23 '23 at 07:27

Ben Voigt · Answer 1 · 2019-01-12T04:34:32.107

4

connect() on UDP does not provide connection demultiplexing.

connect() does two things:

Sets a default address for transmit functions that don't accept a destination address (send(), write(), etc)
Sets a filter on incoming datagrams.

It's important to note that the incoming filter simply discards datagrams that do not match. It does not forward them elsewhere. If there are multiple UDP sockets bound to the same address, some OSes will pick one (maybe random, maybe last created) for each datagram (demultiplexing is totally broken) and some will deliver all datagrams to all of them (demultiplexing succeeds but is incredibly inefficient). Both of these are "the wrong thing". Even an OS that lets you pick between the two behaviors via a socket option is still doing things differently from the way you wanted. The time between bind() and connect() is just the smallest piece of this puzzle of unwanted behavior.

To handle UDP with multiple peers, use a single socket in connectionless mode. To have multiple threads processing received packets in parallel, you can either

call recvfrom on multiple threads which process the data (this works because datagram sockets preserve message boundaries, you'd never do this with a stream socket such as TCP), or
call recvfrom on a single thread, which doesn't do any processing, just queues the message to the thread responsible for processing it.

Even if you had an OS that gave you an option for dispatching incoming UDP based on designated peer addresses (connection emulation), doing that dispatching inside the OS is still not going to be any more efficient than doing it in the server application, and a user-space dispatcher tuned for your traffic patterns is probably going to perform substantially better than a one-size-fits-all dispatcher provided by the OS.

For example, a DNS (DHCP) server is going to transact with a lot of different hosts, nearly all running on port 53 (67-68) at the remote end. So hashing based on the remote port would be useless, you need to hash on the host. Conversely, a cache server supporting a web application server cluster is going to transact with a handful of hosts, and a large number of different ports. Here hashing on remote port will be better.

Do the connection association yourself, don't use socket connection emulation.

edited Jan 12 '19 at 04:34

answered Jan 12 '19 at 04:25

Ben Voigt

277,958
43
419
720

1

You are correct that under Linux, the last socket to bind() to the port receives the datagrams. However, once connect() is called for that socket, the filter reduces the datagrams to those from the peer only. At that point, datagrams from other peers are received by the original socket again. I have tested and confirmed this. Using multiple threads to call recvfrom() is a recipe for the "thundering herd" scenario, while using one thread to read and dispatch is not scalable. Your DNS example is specifically why Google has proposed the SO_REUSEPORT option, but it's hash is problematic too. – user1715587 Jan 12 '19 at 14:26
@user1715587: So you've tested (while is a recipe for future breakage, because you rely on implementation details not contract) some "deliver to last datagram that doesn't filter it out" logic. Have you confirmed the complexity of this logic? Somebody has to read and dispatch the datagrams, and your application will do that more efficiently than the OS. You may have a scalability problem doing it from one thread, but **having the OS do it is even less scalable**. The OS's general purpose demultiplexing will never be as efficient as what a traffic-aware application can achieve. – Ben Voigt Jan 12 '19 at 17:03
Why is it a recipe for future breakage? The man page for connect() specifically states... "If the socket sockfd is of type SOCK_DGRAM, then addr is the address to which datagrams are sent by default, and the only address from which datagrams are received." With respect to scalability, the kernel is already efficiently distributing TCP packets based on the connection tuple, and will do so for UDP as well. I disagree that the application can do it better in a single thread. – user1715587 Jan 12 '19 at 18:42
@user1715587: TCP isn't comparable. TCP is always demultiplexed, so the OS optimizes for that case. TCP doesn't have to handle multicast or broadcast either. (And the OS doesn't do as good a job with TCP as an application tuned for specific traffic patterns could) – Ben Voigt Jan 12 '19 at 18:55
@user1715587: BTW, "thundering herd" doesn't apply to multiple threads calling `recvfrom`. Perhaps you are thinking of multiple threads calling `select()` or `poll()`? – Ben Voigt Jan 12 '19 at 19:00
Yes, I'm referring to an application that uses epoll() or an equivalent to service asynchronous sockets. In that (not uncommon) situation, the thundering herd applies. It can be avoided with flags like EPOLLONESHOT or EPOLLEXCLUSIVE, but then we're back to a single non-scalable thread. Although it's intended for multiple processes, the rationale is similar, so I would encourage a look at https://domsch.com/linux/lpc2010/Scaling_techniques_for_servers_with_high_connection%20rates.pdf – user1715587 Jan 12 '19 at 19:09
@user1715587: You can use `epoll()` on other threads to handle control ports or whatever, but that's beside the point. For the "tens of thousands of datagrams per second" to a single UDP port you claim you want to process, call `recvfrom()` on one socket on multiple threads. That will avoid the thundering herd. Each incoming datagram will complete one pending I/O operation, waking one thread. **Use your knowledge of the different usage of different ports to treat them differently. The OS can't do that.** – Ben Voigt Jan 12 '19 at 19:12
Dedicated recvfrom() threads are not always an option. There are different types of application threading models, and one of the more efficient is a single worker thread per core, kept busy performing event driven tasks (including i/o). – user1715587 Jan 12 '19 at 19:35
@user1715587: We're not talking about many different types of applications, we're talking about the one with "tens of thousands of datagrams arriving per second" on a single UDP port. Did you forget [you said that](https://stackoverflow.com/questions/54155900/udp-server-and-connected-sockets/54156768?noredirect=1#comment95143277_54155900)? And particularly, one where you've also said that a single thread can't keep up with `recvfrom()`. When one thread can't keep up with one socket, you absolutely should be *dedicating* multiple threads to that socket. – Ben Voigt Jan 12 '19 at 19:37
@user1715587: I also note that with `epoll()` you would have extra processing for each datagram, to determine that it came in on the high-volume port as opposed to one of the other fds being watched with `epoll()`. That hurts scalability; you would not do this in a high-traffic situation. Please stop posing strawman arguments. – Ben Voigt Jan 12 '19 at 19:51
This isn't productive anymore. You're reading things into my words that aren't there, and being critical of design approaches without knowing all the details. – user1715587 Jan 12 '19 at 19:56
@user1715587: You're thinking I've never heard of these design approaches before. I'm well aware of them. I'm also well aware that you've described a situation where a specifically-tailored solution is helpful. You're attacking that specifically tailored solution based on things that happen in other (different) design patterns. You haven't actually addressed my suggestion. That is a logic fallacy known as "straw man". "Isn't productive anymore" is inaccurate, because that implies it used to be. But you've been dodging my suggestion from the very first comment. – Ben Voigt Jan 12 '19 at 20:00

score 1 · Answer 2 · edited Mar 12 '19 at 10:58

The issue you described is the one I encountered some time ago doing TCP-like listen/accept mechanism for UDP.

In my case the solution (which turned out to be bad as I will describe later) was to create one UDP socket to receive any incoming datagrams and when one arrives making this particular socket connected to sender (via recvfrom() with MSG_PEEK and connect()) and returning it to new thread. Moreover, new not connected UDP socket was created for next incoming datagrams. This way the new thread (and dedicated socket) did recv() on the socket and was handling only this particular channel from now on, while the main one was waiting for new datagrams coming from other peers.

Everything had worked well until the incoming datagram rate was higher. The problem was that while the main socket was transitioning to connected state, it was buffering not one but a few more datagrams (coming from many peers) and thus thread created to handle the particular sender was reading in effect a few more datagrams not intended to it.

I could not find solution (e.g. creating new connected socket (instead connecting the main one) and pass the received datagram on main socket to its receive buffer for futher recv()). Eventually, I ended up with N threads, each one having one "listening" socket (with use of SO_REUSEPORT) with datagram scattering done on OS level.

Exactly what I encountered in my search for a similar solution. And as well as the original author, I clearly see that such a thing is possible on the OS level. Calling accept() for a UDP socket might do 'connect' and 'bind' ATOMICALLY meaning that no already received datagram will go to a newly created descriptor queue. That's it. Everything else is ALREADY in place in the system, i.e. filtering based on a source address. — neoxic, Dec 11 '19 at 02:00
Arguments that 'accept()' isn't for UDP are not constructive since from that point 'connect()' isn't for UDP either. But it's already there as a convenient mechanism, and 'accept()' might easily follow suit. — neoxic, Dec 11 '19 at 02:04

UDP server and connected sockets

2 Answers2

Linked