Crawling in Golang without loading in memory

Question

I'm trying to rewrite a web crawler in go (originally written in python with gevent). But I've hit a wall, no matter what I do I get fast high memory consumption. For example, the following simple code:

package main

import (
     "bufio"
     "fmt"
     "os"
     "net/http"
     "io"
     "strings"
     "time"
)

func readLine(in *bufio.Reader, domains chan<- string) {
    for conc := 0; conc < 500; conc++ {
        input, err := in.ReadString('\n')
        if err == io.EOF {
            break
        }
        if err != nil {
            fmt.Fprintf(os.Stderr, "read(stdin): %s\n", err)
            os.Exit(1)
        }

        input = strings.TrimSpace(input)
        if input == "" {
            continue
        }

        domain := input
        domains <- domain
    }
}

func get(domains <-chan string) {
   url := <-domains
   URLresp, err := http.Get(url)
   if err != nil {
       fmt.Println(err)
   }
   if err == nil {
       fmt.Println(url," OK")
       URLresp.Body.Close()
   }
}

func main() {
    domains := make(chan string, 500)

    inFile, _ := os.Open("F:\\PATH\\TO\\LIST_OF_URLS_SEPARATED_BY_NEWLINE.txt")
    in := bufio.NewReader(inFile)

    for {
        go readLine(in, domains)
        for i := 0; i < 500; i++ { go get(domains) }
        time.Sleep(100000000)
    }
}

I've tried pprof but it seems to say I'm using 50mb of heap space only, while memory consumption by resource monitoring is skyrocketing.

I also tried creating a custom http Transport without Keep Alive since I found out net/http saves connections for reuse but no luck with that.

You are "endlessly" spawning goroutines (sleep is only 0.1 seconds). You are also using `in` concurrently without synchronization, and all your `readLine()` functions read from the same file, unsynchronized. And you don't check the error opening the file. Nothing good can come out of this. — icza, Mar 07 '18 at 12:57
Your code is full of bugs. The most disasterous one is the effective forkbomb. You're starting an infinite number of goroutines that read in your domains. — Jonathan Hall, Mar 07 '18 at 12:57
If you want to reuse connections, you need to read the response bodies. If you don’t need to reuse connections, you should close them to free up resources sooner. — JimB, Mar 07 '18 at 13:01
I can spawn goroutines slower, and the only thing that changes is that memory usage grows slower. It's still growing. Where is the leak? — geraldog, Mar 07 '18 at 13:27
you need to manage the number of goroutines you spawn - see here https://gobyexample.com/worker-pools — SwiftD, Mar 07 '18 at 13:39
For ideas, check [Is this an idiomatic worker thread pool in Go?](https://stackoverflow.com/questions/38170852/is-this-an-idiomatic-worker-thread-pool-in-go/38172204#38172204) — icza, Mar 07 '18 at 13:45

Jonathan Hall · Answer 1 · 2018-03-07T14:48:13.047

3

Let's consider what's wrong with your code, focusing on your main() function.

func main() {
    domains := make(chan string, 500)

This is good. You create a buffered channel to handle the domain list input. No problem.

    inFile, _ := os.Open("F:\\PATH\\TO\\LIST_OF_URLS_SEPARATED_BY_NEWLINE.txt")

You open the input file. You should never ignore errors, but we'll ignore that for now.

    in := bufio.NewReader(inFile)

    for {

Here you start an infinite loop. Why?

        go readLine(in, domains)

Here you read up to the next 500 lines from the in file, passing them to the domains channel, but you do it in the background, which means the next line will execute before readLine has a chance to finish.

        for i := 0; i < 500; i++ { go get(domains) }

Here you call get(domains) 500 times, in parallel. But as explained above, you do this before readLine has completed, so (at least on the first time through the outer loop), most calls to get() will fail, because the domains channel is likely empty. The get() function doesn't handle this case properly, but I'll leave that for you to consider.

        time.Sleep(100000000)

Then you sleep for 0.1 seconds before starting the infinite loop again.

}
}

The infinite loop will then attempt, again, to read the next 500 items from your file, again, in the background. If the first call to readLine takes more than 0.1 seconds to complete, then you'll have two copies of readLine simultaneously trying to read the file, which will probably cause a panic.

Assuming this is behaving as you expect (although it most certainly and obviously is not), after reading in all the URLs in the file, the program will continue, forever, to spawn an additional 501 go routines every 0.1 seconds. One go routine attempts to read more lines from the file, finds that there are no more, and exits immediately. The other 500 go routines will end up waiting, forever, to read a non-existant result from the domains channel. This is your memory "leak".

edited Mar 07 '18 at 14:48

answered Mar 07 '18 at 13:29

Jonathan Hall

75,165
16
143
189

I'm complaining about memory usage, you haven't addressed that one problem I seem to have. Like I said, even if I spawn slow, memory usage will grow a bit slower, but there's still a leak somewhere. – geraldog Mar 07 '18 at 13:34
3

@geraldog, this _is_ your memory leak! Goroutines are not free, and you’re runnng an ever increasing number of them. – JimB Mar 07 '18 at 13:37
But they should exit, shouldn't they? Unless the goroutines aren't exiting for some reason? The point is that code shouldn't load in memory if I'm spawning slowly, but it does load in memory, so what gives? Why aren't the goroutines exiting? – geraldog Mar 07 '18 at 13:44
I assume all of the above is not exactly as intended - how can you have expectations of memory usage if your program is behaving unpredictably - when you understand what your program is doing (and it makes sense) you can begin to make assumptions about memory consumption – SwiftD Mar 07 '18 at 13:49
It's not behaving unpredictably, it works as intended despite @Flimzy good notes. It just loads in memory terribly. – geraldog Mar 07 '18 at 13:52
@geraldog: what do you mean by "that code shouldn't load in memory"? Each goroutine has a 2k stack, and your spawning 5000 per second. Even if the code worked correctly, there's no way you're crawling websites that quickly. – JimB Mar 07 '18 at 13:52
spawning goroutines faster than closing them will consume all your memory - this is expected behaviour - this should not be your intended behaviour – SwiftD Mar 07 '18 at 13:55
@JimB read that again: shouldn't load in memory IF I'M SPAWNING SLOWLY, but it does load in memory etc. – geraldog Mar 07 '18 at 13:55
spawn 1 a month and report back after 5 iterations, i suspect memory usage will stabilise – SwiftD Mar 07 '18 at 13:56
@geraldog: I'm not sure if you understand what the stack is, but each call to "spawn" a goroutine requires an additional 2k. Spawning slower means it grows slower of course, but that still requires more memory. – JimB Mar 07 '18 at 13:56
@WebweaverD the problem persists if I spawn slowly, it just grows slowly. – geraldog Mar 07 '18 at 13:56
i was joking - the point is i would address some of the issues highlighted and retest, there is too much going on right now to properly evaluate, properly managed this process wont eat memory – SwiftD Mar 07 '18 at 13:58
@geraldog: If you want to see where the stack memory is being used, and where the goroutines are waiting, look at a stack trace. The fundamental issue here is that your concurrency model doesn't make sense. Instead of claiming that it should work, take a look at more reasonable patterns, like those suggested here: https://stackoverflow.com/questions/38170852/is-this-an-idiomatic-worker-thread-pool-in-go/38172204#38172204 – JimB Mar 07 '18 at 14:01
it doesnt make sense to spawn worker processes at a fixed interval if you have no idea how long they will take to run - thats the fundamental flaw in your approach, you need to limit the number of process that are crawling - once you have that you can look at other sources of memory use – SwiftD Mar 07 '18 at 14:04
@WebweaverD That sounds reasonable, – geraldog Mar 07 '18 at 14:14
The problem was the lack of timeout in net Dial however. It wouldn't let the goroutines die. – geraldog Mar 07 '18 at 14:21
I have updated my explanation to spell out the exact nature of the memory consumption problem. – Jonathan Hall Mar 07 '18 at 14:42
@Flimzy that's not what was happening, but thanks for the help. – geraldog Mar 07 '18 at 15:07
@geraldog: Well, that's what the code you pasted will do. If that's not what's happening, then you're executing different code. If your list of URLs is long enough, it's possible that your program is crashing (or being killed) for other reasons, before it reaches that state. – Jonathan Hall Mar 07 '18 at 15:09

score -3 · Answer 2 · answered Mar 07 '18 at 14:31

-3

The problem was the lack of a default timeout in golang net Dial. It would hog resources by not letting goroutines die. The following Works:

c := &http.Client{
 Transport: &http.Transport{
     DisableKeepAlives: true,
     Dial: (&net.Dialer{
             Timeout:   30 * time.Second,
             KeepAlive: 30 * time.Second,
     }).Dial,
     TLSHandshakeTimeout:   10 * time.Second,
     ResponseHeaderTimeout: 10 * time.Second,
     ExpectContinueTimeout: 1 * time.Second,}}

URLresp, err := c.Get(url)

answered Mar 07 '18 at 14:31

geraldog

11
1

I doubt that is really the problem, as the [`DefaultTransport`](https://golang.org/pkg/net/http/#DefaultTransport) already has Dial timeouts set. This also hints that this code is in your `get` goroutine, but you should _never_ discard an http.Transport, as that is also going to cause memory leaks. – JimB Mar 07 '18 at 14:38
https://medium.com/@nate510/don-t-use-go-s-default-http-client-4804cb19f779 – geraldog Mar 07 '18 at 14:43
1

Yes, I've read that too, except the point of the article was to set the `Client.Timeout`, which you're not doing. `ResponseHeaderTimeout` partially covers the situation, but still doesn't abort slow/stalled responses. Setting the client timeout or using requests with a cancellation context will prevent any request situation from hanging indefinitely. – JimB Mar 07 '18 at 14:49
2

Setting a timeout won't fix the problem of your goroutines waiting forever for more input on the `domains` channel after the file has been completely read. `url := <-domains` is where your program will be hung, for all of eternity. – Jonathan Hall Mar 07 '18 at 14:50
Right, but see https://stackoverflow.com/questions/19696448/default-http-dial-timeout-value-in-golang – geraldog Mar 07 '18 at 14:53
@Flimzy that was a quick and dirty example just to demonstrate the problem, but thanks. I have pretty big files though... – geraldog Mar 07 '18 at 14:54
@geraldog: The HTTP timeout may well be a bug, and adjusting it may help, but it won't solve your biggest problem. – Jonathan Hall Mar 07 '18 at 14:57
@geraldog: I'm not sure why you're referencing that old SO question, but I did link to the current `DefaultTransport` documentation, which shows exactly what timeouts are set and used by default. – JimB Mar 07 '18 at 14:59
Right, @JimB. Not sure why this works, then... – geraldog Mar 07 '18 at 15:10

Crawling in Golang without loading in memory

2 Answers2