epoll is a Linux 2.6 readiness notification API for sockets, pipes, and special event-, signal-, and timer descriptors which can operate both in level- and edge-triggered mode, although presently only level-triggered behaviour is in accordance with the documentation. As opposed to poll or select, epoll scales O(1) in respect to the number of descriptors and O(N) in respect realized events.
The epoll API is built around 3 functions:
- epoll_create creates a new epoll instance and returns a file descriptor that refers to it. This descriptor can be operated on with the other epoll functions and can be added to a different epoll instance
- epoll_ctl allows file descriptors (sockets, pipes, eventfd, timerfd, signalfd, and epoll) being added and removed to an epoll's set of monitored descriptors, as well as flags of existing descriptors being modified
- epoll_wait will return up to maxevents queued events. If no events are available, it will return zero. If a timeout is provided and no events are available, epoll_wait will block for the duration of the timeout (a value of -1 means forever).
The conceptual idea behind the API is that applications usually have a certain set of descriptors that changes rarely if ever, but which needs to be observed for readiness many times. Also, typically a lot fewer descriptors are ready than open. epoll therefore separates copying the list of descriptors to watch from the actual watching and notifies registered listeners instead of iterating a list of descriptors.
The operation of level-triggered mode (default) is easy, since it is identical of how poll/select works. As long as the resource is ready (e.g. as long as there remains data to be read), every call to epoll_wait will return an event.
The operation of edge-triggered mode (EPOLLET flag) is more complicated, more error-prone, inconsistenly documented, and inconsistently implemented. In epoll(7)
, it is explained in terms of reading partial data causing the next call to epoll_wait to block until new data arrives, but not while some data remains in the buffers. It is therefore recommended to use non-blocking descriptors and reading until EAGAIN is received.
According to The Linux Programming Interface, edge-triggered mode only reports events that happened since the last call to epoll_wait.
In reality, it does a mixture of both (i.e. both reads and epoll_wait reset the status to "not ready"), and it does not work as indicated in respect of several epoll instances listening to the same socket or several threads waiting on the same epoll instance (observed under kernel 2.6.38 with timerfd and eventfd). Although epoll is supposed to signal all waiters upon arrival of an event, in edge-triggered mode it only ever signals a single waiter.