I'm building a crawler that takes a URL, extracts links from it, and visits each one of them to a certain depth; making a tree of paths on a specific site.
The way I implemented parallelism for this crawler is that I visit each new found URL as soon as it's found like this:
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
wg.Wait()
}
func deduplicate(ch chan string, wg *sync.WaitGroup) {
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
// handle the link and create a variable "links" containing the links found inside the page
wg.Add(len(links))
for _, l := range links {
q <- l}
}
}
This works fine for relatively small sites, but when I run it on a large one with a lot of link everywhere, I start getting one of these two errors on some requests: socket: too many open files
and no such host
(the host is indeed there).
What's the best way to handle this? Should I check for these errors and pause execution when I get them for some time until the other requests are finished? Or specify a maximum number of possible requests at a certain time? (which makes more sense to me but not sure how to code up exactly)