3

I am planning to write my scraper with V and i need to send estimatedly ~2500 request per second but can't figure out what am i doing wrong, it should be sending concurrently but it is deadly slow right now. Feels like i'm doing something really wrong but i can't figure it out.

import net.http
import sync
import time

fn send_request(mut wg sync.WaitGroup) ?string {
    start := time.ticks()
    data := http.get('https://google.com')?
    finish := time.ticks()
    println('Finish getting time ${finish - start} ms')
    wg.done()
    return data.text
}



fn main() {
    mut wg := sync.new_waitgroup()
    for i := 0; i < 50; i++ {
        wg.add(1)
        go send_request(mut wg)
    }
    wg.wait()
}

Output:

...
Finish getting time 2157 ms
Finish getting time 2173 ms
Finish getting time 2174 ms
Finish getting time 2200 ms
Finish getting time 2225 ms
Finish getting time 2380 ms
Finish getting time 2678 ms
Finish getting time 2770 ms

V Version: 0.1.29

System: Ubuntu 20.04

Yagiz Degirmenci
  • 16,595
  • 7
  • 65
  • 85
  • I think the OS has limitations too on how many connections can be made at once. I've run into this on Linux and Mac. That being said, you may want to implement a queue to do say 100 at a time. – chovy Jan 01 '21 at 00:13
  • might need a proxy rotator. your ip probably being throttled. does `net.http` in `v` support proxies? – chovy Jan 21 '21 at 07:28

3 Answers3

2

You're not doing anything wrong. I'm getting similar results in multiple languages in multiple ways. Many sites have rate limiting software that prevent repeated reads like this, that's what you're running up against.

You could try using channels now that they're in, but you'll still run up against the rate limiter.

Major
  • 544
  • 4
  • 19
  • I was also suspecting it could be a network bottleneck, but I was getting better results with an interpreted language like Python and it was way slower than my GO's performance, well there shouldn't be much difference since it's heavily on Network operations and a few Socket API calls, but channels sounds worth trying, thank you – Yagiz Degirmenci Nov 28 '20 at 11:03
1

Working a lot in V's concurrent spheres in the recent weeks the best way I've found to do it is using a pool processor.

A snip from v/examples:

fn worker_fetch(mut p pool.PoolProcessor, cursor int, worker_id int) voidptr {
    id := p.get_item[int](cursor)
    resp := http.get('https://hacker-news.firebaseio.com/v0/item/${id}.json') or {
        println('failed to fetch data from /v0/item/${id}.json')
        return pool.no_result
    }
    story := json.decode(Story, resp.body) or {
        println('failed to decode a story')
        return pool.no_result
    }
    println('# ${cursor}) ${story.title} | ${story.url}')
    return pool.no_result
}

// Fetches top HN stories in parallel, depending on how many cores you have
fn main() {
    resp := http.get('https://hacker-news.firebaseio.com/v0/topstories.json') or {
        println('failed to fetch data from /v0/topstories.json')
        return
    }
    ids := json.decode([]int, resp.body) or {
        println('failed to decode topstories.json')
        return
    }#[0..10]
    mut fetcher_pool := pool.new_pool_processor(
        callback: worker_fetch
    )
    // Note: if you do not call set_max_jobs, the pool will try to use an optimal
    // number of threads, one per each core in your system, which in most
    // cases is what you want anyway... You can override the automatic choice
    // by setting the VJOBS environment variable too.
    // fetcher_pool.set_max_jobs( 4 )
    fetcher_pool.work_on_items(ids)
}

src: https://github.com/vlang/v/blob/master/examples/news_fetcher.v

docs: https://modules.vosca.dev/standard_library/sync/pool.html

tenxsoydev
  • 370
  • 2
  • 10
-1

Best way to send that many get requests it too use what is called a Head request, it relies on status code rather than a response since it doesn't return any. Which is what makes the http requests faster.

x3-
  • 19
  • 2