0

I'm building a crawler that takes a URL, extracts links from it, and visits each one of them to a certain depth; making a tree of paths on a specific site.

The way I implemented parallelism for this crawler is that I visit each new found URL as soon as it's found like this:

func main() {
    link := "https://example.com"

    wg := new(sync.WaitGroup)
    wg.Add(1)

    q := make(chan string)
    go deduplicate(q, wg)
    q <- link
    wg.Wait()
}

func deduplicate(ch chan string, wg *sync.WaitGroup) {
    for link := range ch {
        // seen is a global variable that holds all seen URLs
        if seen[link] {
            wg.Done()
            continue
        }
        seen[link] = true
        go crawl(link, ch, wg)
    }
}

func crawl(link string, q chan string, wg *sync.WaitGroup) {
    // handle the link and create a variable "links" containing the links found inside the page
    wg.Add(len(links))
    for _, l := range links {
        q <- l}
    }
}

This works fine for relatively small sites, but when I run it on a large one with a lot of link everywhere, I start getting one of these two errors on some requests: socket: too many open files and no such host (the host is indeed there).

What's the best way to handle this? Should I check for these errors and pause execution when I get them for some time until the other requests are finished? Or specify a maximum number of possible requests at a certain time? (which makes more sense to me but not sure how to code up exactly)

apaderno
  • 28,547
  • 16
  • 75
  • 90
D. Anderson
  • 9
  • 1
  • 2
  • You are facing a problem related to the limit of opened files per user that is controlled by operating system. If you are using Linux/Unix you can probably increase limit using ulimit -n 4096 command. This command has a threshold and it can't set whichever the number of opened files you want. So if you want to push it further then you need to modify /etc/security/limits.conf file and set hard and a soft limit. – Radosław Załuska Oct 18 '17 at 21:07
  • 2
    Also, you are starting a goroutine for each link you enconuter and if there;s to many of them at certain point defeats the purpose of goroutines and actually takes longer to do the task. You should try having a fixed number of goroutines to do the processing and read from a channel instead of starting a new one for each link. Take a look at https://blog.golang.org/pipelines – Topo Oct 18 '17 at 21:14
  • 4
    Or maybe a pattern like: https://gobyexample.com/worker-pools? (BTW, your `WaitGroup` usage is quite odd. Add 1 for each goroutine, and defer `Done` from within each goroutine. Anything else is asking for bugs) – JimB Oct 18 '17 at 21:30
  • 1
    The best way to handle it depends on what you want to do. This is a design decision, not a technical problem. – Jonathan Hall Oct 19 '17 at 20:30

1 Answers1

2

The files being referred to in the error socket: too many open files includes Threads and sockets (the http requests to load the web pages being scraped). See this question.

The DNS query also most likely fails due being unable to create a file, however the error that is reported is no such host.

The problem can be fixed in two ways:

1) Increase the maximum number of open file handles
2) Limit the maximum number of concurrent `crawl` calls

1) Is the simplest solution, but might not be ideal as it only postpones the problem until you find a website that has more links that the new limit . For linux use can set this limit with ulimit -n.

2) Is more a problem of design. We need to limit the number of http requests that can be made concurrently. I have modified the code a little. The most important change is maxGoRoutines. With every scraping call that starts a value is inserted into the channel. Once the channel is full the next call will block until a value is removed from the channel. A value is removed from the channel every time a scraping call finishes.

package main

import (
    "fmt"
    "sync"
    "time"
)

func main() {
    link := "https://example.com"

    wg := new(sync.WaitGroup)
    wg.Add(1)

    q := make(chan string)
    go deduplicate(q, wg)
    q <- link
    fmt.Println("waiting")
    wg.Wait()
}

//This is the maximum number of concurrent scraping calls running
var MaxCount = 100
var maxGoRoutines = make(chan struct{}, MaxCount)

func deduplicate(ch chan string, wg *sync.WaitGroup) {
    seen := make(map[string]bool)
    for link := range ch {
        // seen is a global variable that holds all seen URLs
        if seen[link] {
            wg.Done()
            continue
        }
        seen[link] = true
        wg.Add(1)
        go crawl(link, ch, wg)
    }
}

func crawl(link string, q chan string, wg *sync.WaitGroup) {
    //This allows us to know when all the requests are done, so that we can quit
    defer wg.Done()

    links := doCrawl(link)

    for _, l := range links {
        q <- l
    }
}

func doCrawl(link string) []string {
    //This limits the maximum number of concurrent scraping requests
    maxGoRoutines <- struct{}{}
    defer func() { <-maxGoRoutines }()

    // handle the link and create a variable "links" containing the links found inside the page
    time.Sleep(time.Second)
    return []string{link + "a", link + "b"}
}
Hein Oldewage
  • 290
  • 1
  • 3