5

I wrote a multithreaded asynchronous HTTP server in Rust using mio. When I run a load test (using siege) the server works fine on the first load test but when the load test is done all requests to the server start failing.

Using some logging, I noticed that every new connection I get with accept() receives a hangup event immediately. The server connection itself doesn't get any error or hangup events.

I am running Rust 1.12.0 with mio 0.6 on OS X 10.11 El Capitan

Here's the main event loop of my server:

pub fn run(self) {
    let poll = Poll::new().unwrap();
    let server = TcpListener::bind(&SocketAddr::from_str(&self.host).unwrap()).unwrap();
    poll.register(&server, SERVER, Ready::readable(), PollOpt::edge()).unwrap();
    let mut events = Events::with_capacity(1024);
    let mut next_conn: usize = 1;
    let mut workers = Vec::new();
    // Create worker threads.
    for _ in 0..self.num_workers {
        let (tx, rx) = channel();
        let worker_handler = self.event_handler.duplicate();
        thread::spawn(move || {
            Self::process_events(rx, worker_handler);
        });
        workers.push(tx);
    }
    loop {
        println!("Polling...");
        match poll.poll(&mut events, None) {
            Err(e) => panic!("Error during poll(): {}", e),
            Ok(_) => {}
        }
        for event in events.iter() {
            match event.token() {
                SERVER => {
                    println!("Accepting..");
                    match server.accept() {
                        Ok((stream, _)) => {
                            println!("Registering new connection...");
                            match poll.register(&stream,
                                                Token(next_conn),
                                                Ready::readable(),
                                                PollOpt::edge()) {
                                Err(e) => panic!("Error during register(): {}", e),
                                Ok(_) => {
                                    println!("New connection on worker {} ",
                                             next_conn % self.num_workers);
                                    workers[next_conn % self.num_workers]
                                        .send(Msg::NewConn(next_conn, stream))
                                        .unwrap();
                                    next_conn += 1;
                                }
                            }
                        }
                        Err(e) => panic!("Error during accept() : {}", e),
                    }
                }
                Token(id) => {
                    println!("Sending event on conn {} to worker {}",
                             id,
                             id % self.num_workers);
                    workers[id % self.num_workers]
                        .send(Msg::ConnEvent(id, event.kind()))
                        .unwrap();
                }
            }
        }
    }
}

fn process_events(channel: Receiver<Msg>, mut event_handler: Box<EventHandler>) {
    loop {
        let msg = channel.recv().unwrap();
        match msg {
            Msg::NewConn(id, conn) => {
                event_handler.new_conn(id, conn);
            }
            Msg::ConnEvent(id, event) => {
                event_handler.conn_event(id, event);
            }
        }
    }
}

Full code with the example webapp I am using is available on GitHub.

Load test command:

siege -b -c10 -d10 -t20S http://localhost:8080
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
ElefEnt
  • 2,027
  • 1
  • 16
  • 20

1 Answers1

5

I don't know why load testing apps don't document this better. I ran into possibly the same problem a few months ago. It sounds like you've reached the "Ephemeral Port Limit". Here are some quotes from the article that summarize the idea:

Whenever a connection is made between a client and server, the system binds that connection to an ephemeral port – a set of ports specified at the high end of the valid port range.

The total number of ephemeral ports available on OS X is 16,383.

Note that this limitation does not affect real-world requests to a live server because each TCP connection is defined by the tuple of source IP, source port, destination IP and destination port – so the ephemeral port limit only applies to a single client / server pair.

In other words, it's happening because you're running the load test from localhost to localhost, and running out of ephemeral ports after likely around 16,383 connections.

There are a couple things you can do to test whether this is the problem:

  1. Have your load tester report the number of connections made. If it's around 16,000, then this is a likely culprit.

  2. Increase the ephemeral port limit and run your load tests again. If you get a higher number of connections, then this is probably the issue. But remember, if this is the problem, it won't be a problem in the wild.

You can see your ephemeral port range using this command:

$ sysctl net.inet.ip.portrange.first net.inet.ip.portrange.last

And you can increase it using this command:

$ sysctl -w net.inet.ip.portrange.first=32768

After running your tests, you should probably set the port range back to what it was before, since this increase represents a non-standard range.

Community
  • 1
  • 1
Cully
  • 6,427
  • 4
  • 36
  • 58
  • Two remarks: (1) Linux used to have a bug where the port was unique across all interfaces, but now two interfaces can use the same port number; it may be worth binding to other interfaces than localhost to get more ports; (2) it is possible to configure your system to allocate ephemeral port numbers from a larger pool. Obviously, this does not remove the limit, just pushes it further. – Matthieu M. Oct 18 '16 at 07:56
  • Interesting but my load test is actually pretty small (20 secs, 10 users). I doubt it can reach 16k connections with these settings. – ElefEnt Oct 19 '16 at 02:13
  • 1
    @ElefEnt this isn't a case where you have to *doubt* anything — prove it! 800 connections a second * 20 seconds = 16000. That would be 1 connection every millisecond, which is a very reasonable timeframe. Your load testing tool should **tell you** how many connections were made. You can also consider running `strace` or equivalent to see what errors are reported by the OS during the connection. – Shepmaster Oct 19 '16 at 21:27
  • When I've run load tests, it doesn't take long to get to 16000 connections. And like @Shepmaster commented, you should be able to ask siege to report the number of connections made. – Cully Oct 19 '16 at 22:00
  • @ElefEnt I added some comments about how you can test if this is actually the problem. – Cully Oct 23 '16 at 21:57