2

With the client code below (and a listening web server on port 8088 on this box), I am rarely able to get more than 23000 hits before this error pops up from the client.Get():

panic: Get http://localhost:8088/: dial tcp 127.0.0.1:8088: can't assign requested address

Oddly, if I increase the timer delay (i.e. from a millisecond to a microsecond) it takes far more hits to get the error, 170,000 or even more.

Looking at the network traffic, each client connection is used only a handful of times before it disconnects (i.e. the client side sends a FIN). So clearly it's making many TCP connections and overflowing the socket table. Given that the Golang HTTP docs say that keepalives are enabled by default, I would't expect this. A kernel trace shows no errors being emitted by the underlying socket before the close (other than EAGAIN, which is expected and doesn't always precede a socket close).

This is with Go 1.4.2 on OSX (14.4.0). Why are the client connections not being reused the whole time?

package main

import (
    "io/ioutil"
    "net/http"
    "runtime"
    "sync"
    "time"
)

var reqnum = 0

func hit(client *http.Client) {
    resp, err := client.Get("http://localhost:8088/")
    if err != nil {
        println(reqnum)
        panic(err)
    }
    defer resp.Body.Close()
    _, err = ioutil.ReadAll(resp.Body)
    if err != nil {
        panic(err)
    }
    reqnum++ // not thread safe, but shouldn't cause errors.
}

func main() {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())
    client := &http.Client{}
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            ticker := time.NewTicker(time.Microsecond * 1000)
            for j := 0; j < 120000; j++ {
                <-ticker.C
                hit(client)
            }
            ticker.Stop()
        }()
    }
    wg.Wait()
}
Allen Luce
  • 7,859
  • 3
  • 40
  • 53
  • Oh, good question. I can't explain that *exact* behavior, but something very much like this can happen because of TCP TIME_WAIT. (Ports remain reserved briefly after use, and if you run out of ports, you're hosed.) Will try and find a resource about it. – twotwotwo Aug 04 '15 at 04:37
  • Hmf, I'm less sure the more I look. First, here's background on TCP TIME_WAIT, which can't hurt you regardless: http://stackoverflow.com/questions/337115/setting-time-wait-tcp – twotwotwo Aug 04 '15 at 05:04
  • In theory, I'd think Go's support for keepalives would keep the number of connections small. If the server only allowed, say, 10 requests on a connection (or something on the Go side imposed that limit), you could end up making 12k connections/worker, and you couldn't reuse their ports for a bit because of the left-over "ghosts" of connections in TIME_WAIT state. But that doesn't line up with a higher rate making the run succeed sometimes; a longer delay would *help* you then, since it would allow time for TIME_WAIT to expire on old connections. – twotwotwo Aug 04 '15 at 05:10
  • If shortening the delay somehow caused Go to be able to reuse its connections more often, that could explain how it apparently helped to lower it. But also note: since TIME_WAIT state lasts for a while (possibly 120 seconds, seemingly the default [MSL](http://stackoverflow.com/questions/1216267/ab-program-freezes-after-lots-of-requests-why) on Mac OS X), when you run two tests the leftovers from test 1 could affect test 2, which could be adding noise to your experiments. – twotwotwo Aug 04 '15 at 05:16
  • Finally, as an extra data point (from the [Q linked above](http://stackoverflow.com/questions/1216267/ab-program-freezes-after-lots-of-requests-why)), you have a total of 16384 ports for connections. If they're all in use or in TIME_WAIT state, you're stuck. You could see if lowering your MSL magically cures the problem (which at least gives you a diagnosis), and/or either dig into how keepalive works on the server (or in the Go stdlib), or try to observe when connections are actually being made (would be strace on Linux, not sure what you do on OS X). – twotwotwo Aug 04 '15 at 05:21
  • Why wouldn't this program make only 10 TCP connections during the entire run time? I'd expect connections to be used continuously [except in certain cases](https://github.com/golang/go/blob/release-branch.go1.4/src/net/http/transport.go#L891) that I don't expect to hit with this program. It should never get to the point where the number of active sockets is a worry. I'm not going to worry about system resource limitations until I have a handle on why there are more than 10 connections being made. – Allen Luce Aug 04 '15 at 07:07
  • 1
    Hrm, eyeballing that source file, one possibility is `MaxIdleConnsPerHost`. That limit is per-`Client`, and you're sharing one among the ten workers. It would also be consistent with a very fast ticker sometimes completing the test: you'll less often have 3+ conns sitting idle if you put your connections back to work 1000x more quickly. Straightforward test is whether either an artificially increased max or a client per worker changes the connection behavior. – twotwotwo Aug 04 '15 at 07:53
  • (`MaxIdleConns` is enforced in `putIdleConn` which is called on L913 as part of the expression sent to `waitForBodyRead`, which then goes into `alive` on L929. Hopeful about the `MaxIdleConns` idea, but if it doesn't pan out, the best I've got is to, ugh, hack a copy of `net/http` to start logging the various bools that go into the close decision until you find which one is to blame.) – twotwotwo Aug 04 '15 at 08:04
  • Currently, osx puts released loopback ports into a (as far as I can find) an unmarked TIME_WAIT queue for 15 seconds. You won't see them with netstat, but you won't be able to make more than 16384 connections in 15 seconds without port allocation errors. – JimB Aug 04 '15 at 13:17

1 Answers1

9

The error can't assign requested address during a Dial is caused by running out of local ephemeral ports to use for the client connection. The reason you're running out of ports, is simply because you're making too many connections, too fast. What happens when you speed up the connection rate, is that you start to catch the idle connections going back into the pool before they are closed. There's a code path that catches these newly idle connections during a Dial to return a connection more quickly, but there's no way to deterministically catch these connections every time.

What you need to do since you're connecting to only one host (as discussed in the comments), is to set the Transport.MaxIdleConnsPerHost a lot higher. You'll need to see where it balances out, between too many open connections, and when you start recycling them too quickly.

It may even be advantageous to have a semephore on client to prevent too many simultaneous connections, which would start causing the connections to again recycle too quickly.

JimB
  • 104,193
  • 13
  • 262
  • 255
  • That's totally it! Setting `MaxIdleConnsPerHost` to 10 in the transport makes it so only 10 connections are ever made during the lifetime of that program. Thanks! – Allen Luce Aug 04 '15 at 17:36