2

I have been trying to access StackOverflow with the amount of 30 requests / second but it not working. It has been blocked after a few seconds. Although the document of StackOverflow said the max rate limit of StackExchange is 30 req /s.

The libraries i used to access is gocolly Here is my code:

package main

import (
    "fmt"
    "log"
    "strconv"

    "time"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/debug"
)

func finish() {
    fmt.Println("Finish")
}

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("stackoverflow.com"),
        colly.MaxDepth(1),
        colly.Async(true),
        colly.Debugger(&debug.LogDebugger{}),
    )

    c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 10, Delay: 1 * time.Second})
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
    })

    c.OnError(func(_ *colly.Response, err error) {
        log.Println("Something went wrong:", err)
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", r.Request.URL)
    })
    c.OnHTML("#questions", func(e *colly.HTMLElement) {
        e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
            link := el.ChildAttr("a[href]", "href")
            e.Request.Visit("https://stackoverflow.com" + link)
        })
    })

    for i := 0; i <= 1000; i++ {
        
       var link = "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
       c.Visit(link)
       c.Wait()

    }

    finish()
}

I hope someone can help me.

1 Answers1

1

Unfortunately, I was not able to repro your issue on my machine. By the way, I'll point out some things that will improve your solution. First, let me share my working solution:

package main

import (
    "fmt"
    "log"
    "strconv"
    "time"

    "github.com/gocolly/colly/v2"
)

func finish() {
    fmt.Println("Finish")
}

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("stackoverflow.com"),
        colly.MaxDepth(1),
        colly.Async(true),
        // colly.Debugger(&debug.LogDebugger{}),
    )

    c.Limit(&colly.LimitRule{DomainGlob: "*stackoverflow.*", Parallelism: 8, Delay: 1 * time.Second})
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36")
    })

    c.OnError(func(_ *colly.Response, err error) {
        log.Println("Something went wrong:", err)
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", r.Request.URL)
    })
    c.OnHTML("#questions", func(e *colly.HTMLElement) {
        e.ForEach(".s-post-summary.js-post-summary", func(i int, el *colly.HTMLElement) {
            link := el.ChildAttr("a[href]", "href")
            e.Request.Visit("https://stackoverflow.com" + link)
        })
    })

    for i := 0; i <= 29; i++ {
        link := "https://stackoverflow.com/questions?tab=votes&page=" + strconv.Itoa(i)
        c.Visit(link)
    }

    c.Wait()
    finish()
}

The changes done are:

  1. Decreased from 10 to 8 the potential concurrent threads.
  2. Used my User-Agent value.
  3. Put the c.Wait call outside the for loop.

The last change is the most important as you misunderstood its usage. Basically, it waits for all the threads that were created before (e.g. based on your machine you might have 8 concurrent threads working on your crawl requests). If you put this statement within the loop, every time you're waiting only for the just-instantiated thread resulting in synchronous operations.

You can easily notice with a couple of attempts. If you leave the c.Wait within the for loop you notice that the pages are visited in an ordered way. If you put this statement out of the for loop, the pages get visited in an unsorted way.

Let me know if with these changes also your solution works, thanks!

ossan
  • 1,665
  • 4
  • 10