0

I'm building an application that will be downloading roughly 5000 CSV files concurrently using go routines and plain ol http get requests. Downloading the files in parallel.

I'm currently running into open file limits imposed by OS X.

The CSV files are served over http. Are there any other network protocols that I can use to batch each request into one? I don't have access to the server, so I can't zip them. I'd also prefer not to change the ulimit because once in production, I probably won't have access to that configuration.

ZiggidyCreative
  • 335
  • 3
  • 16

1 Answers1

3

You probably want to limit active concurrent requests to a more sensible number than 5000. Possibly spin up 10/20 workers and send individual files to them over a channel.

The http client should reuse connections for requests, assuming you always read the entire request body, and close it.

Something like this:

func main() {
    http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 100
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go worker()
    }
    var csvs = []string{"http://example.com/a.csv", "http://example.com/b.csv"}
    for _, u := range csvs {
        ch <- u
    }
    close(ch)
    wg.Wait()
}

var ch = make(chan string)
var wg sync.WaitGroup

func worker() {
    defer wg.Done()
    for u := range ch {
        get(u)
    }
}

func get(u string) {
    resp, err := http.Get(u)
    //check err here

    // make sure we always read rest of body, and close
    defer resp.Body.Close()
    defer io.Copy(ioutil.Discard, resp.Body)

    //read and decode / handle it. Make sure to read all of body.
}
captncraig
  • 22,118
  • 17
  • 108
  • 151
  • note that unless you set `MaxIdleConnsPerHost` on the transport, or change `http.DefaultMaxIdleConnsPerHost`, the transport isn't going to be reusing most of the connections. – JimB Aug 01 '17 at 17:41
  • 1
    @JimB according to [godoc](https://golang.org/pkg/net/http/#RoundTripper), it looks like http.DefaultTransport has a default value of 100. – captncraig Aug 01 '17 at 17:49
  • nvm, I read wrong. Editing post slightly. `http.DefaultMaxIdleConnsPerHost` is a const. Gotta set it on `http.DefaultTransport`. – captncraig Aug 01 '17 at 17:50
  • 1
    You were entirely correct. Will default to 2 per host if you don't override. At high concurrency though, I would expect a small pool of idle hosts at any given time though. Each goroutine will return one and immediately take it. – captncraig Aug 01 '17 at 17:52
  • Oh yes, forgot that's a const. Also, `http.DefaultTransport` is an interface, you need a type assertion. – JimB Aug 01 '17 at 17:53
  • At high concurrency is exactly where this falls over. The pool of idle connections doesn't have any elasticity, so as soon as you have more than 2 per host in the pool you can end up cycling through new connections very quickly, slowing things down, leaving more connections unserviced, and can easily exhaust the ephemeral port range. – JimB Aug 01 '17 at 17:58
  • Yes, that's why I recommend 10-20 workers max. Number of opened connections shouldn't really exceed number of workers will it? – captncraig Aug 01 '17 at 18:05
  • 1
    That shouldn't be a problem in general; as soon as a goroutine releases a connection, it will take it back up for the next request in the queue, so the number of *idle* connections should never be very high. It shouldn't be cycling through new connections at all. – Adrian Aug 01 '17 at 18:08
  • 1
    Yes. Each concurrent Get request will have its own connection, and if they hit the idle pool concurrently all extras get closed, causing that worker to Dial again for the next request. The rule of thumb is to set `MaxIdleConnsPerHost` equal to your expected concurrency. – JimB Aug 01 '17 at 18:08
  • @Adrian: that would be true in a perfect world, but in practice multiple connections hit the idle pool concurrently quite frequently, and once you have to start dialing new connection things slow down even further causing more connections to hit the idle pool. This isn't some rare issue, it come up all the time for busy http clients, and been referenced here on SO many time as well as the mailing list. – JimB Aug 01 '17 at 18:15
  • Thanks all. I'm going to work on understanding everything above more deeply. Will implement this and see how it works for me. – ZiggidyCreative Aug 02 '17 at 06:21