0

I'm using Rust to download huge amounts of stock market data, around 50,000 GET requests per cycle. To make the process go significantly faster, I've been able to use multithreading. My code so far looks like this:

// Instantiate a channel so threads can send data to main thread
let (s, r) = channel();

// Vector to store all threads created
let mut threads = Vec::new();

// Iterate through every security in the universe
for security in universe {

    // Clone the sender
    let thread_send = s.clone();
            
    // Create a thread with a closure that makes 5 get requests for the current security
    let t = thread::spawn(move || {

        // Download the 5 price vectors and send everything in a tuple to the main thread
        let price_vectors = download_security(&security);
        let tuple = (security, price_vectors.0, price_vectors.1, price_vectors.2, price_vectors.3, price_vectors.4);
        &thread_send.send(tuple).unwrap();

    });

    // PAUSE THE MAIN THREAD BECAUSE OF THE ERROR I'M GETTING
    thread::sleep(Duration::from_millis(20));

    // Add the new thread to the threads vector
    threads.push(t);

};

drop(s);

// Join all the threads together so the main thread waits for their completion
for t in threads {
    t.join();
};

The download_security() function that each thread calls simply makes 5 GET requests to download price data (minutely, hourly, daily, weekly, monthly data). I'm using the ureq crate to make these requests. The download_security() function looks like this:

// Call minutely data and let thread sleep for arbitrary amount of time
let minute_text = ureq::get(&minute_link).call().unwrap().into_string().unwrap();
thread::sleep(Duration::from_millis(1000));

// Call hourly data and let thread sleep for arbitrary amount of time
let hour_text = ureq::get(&hour_link).call().unwrap().into_string().unwrap();
thread::sleep(Duration::from_millis(1000));

// Call daily data and let thread sleep for arbitrary amount of time
let day_text = ureq::get(&day_link).call().unwrap().into_string().unwrap();
thread::sleep(Duration::from_millis(1000));

// Call weekly data and let thread sleep for arbitrary amount of time
let week_text = ureq::get(&week_link).call().unwrap().into_string().unwrap();
thread::sleep(Duration::from_millis(1000));

// Call monthly data and let thread sleep for arbitrary amount of time
let month_text = ureq::get(&month_link).call().unwrap().into_string().unwrap();
thread::sleep(Duration::from_millis(1000));

Now, the reason I'm putting my threads to sleep throughout this code is because it seems that whenever I make too many HTTP requests too fast, I get this strange error:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Transport(Transport { kind: Dns, message: None, url: Some(Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("api.polygon.io")), port: None, path: "/v2/aggs/ticker/SPHB/range/1/minute/2021-05-22/2021-10-22", query: Some("adjusted=true&sort=asc&limit=200&apiKey=wo_oZg8qxLYzwo3owc6mQ1EIOp7yCr0g"), fragment: None }), source: Some(Custom { kind: Uncategorized, error: "failed to lookup address information: nodename nor servname provided, or not known" }) })', src/main.rs:243:54

When I increase the amount of time that my main thread sleeps after creating a new subthread, OR the amount of time that my subthreads sleep after making each of the 5 GET requests, the number of these errors goes down. When the sleeps are too short, I'll see this error printed out for 90%+ of the securities I try to download. When the sleeps are longer, everything works perfectly, except that the process takes WAY too long. This is frustrating because I need this process to be as fast as possible, preferably <1 minute for all 10,000 securities.

I'm running macOS Big Sur on an M1 Mac Mini. Is there some kind of fundamental limit on my OS as to how many GET requests I can make per second?

Any help would be greatly appreciated.

Thank you!

JBL
  • 33
  • 3
  • Strange indeed. Looking up the relevant error message yields [this](https://github.com/denoland/deno/issues/8155), [this](https://github.com/denoland/deno/issues/8155), and [this](https://stackoverflow.com/a/57617585/2189130) from a variety of languages indicating your problem is systemic. And even though its a DNS error, some suggest its a red-herring and the problem is actually due to the program hitting a resource limit. Maybe [increasing the open file limit](https://superuser.com/q/1634286/756751) can help – kmdreko Oct 27 '21 at 04:27
  • 2
    IIUC you're spawning ten thousand threads, so you're trying to make up to 10000 simultaneous requests. Try using a thread pool instead to limit the number of simultaneous operations. – Jmb Oct 27 '21 at 06:37
  • I think the best solution here would actually be trying to reduce the number of requests. One idea is to see if you can batch multiple requests together; e.g. see if there's an API to get minute/hour/day/week/month data for a given security all in one request, or even get data for multiple securities at once. – Coder-256 Oct 28 '21 at 06:38
  • Better yet: probably the easiest/best change you can do is using an [`Agent`](https://docs.rs/ureq/2.3.0/ureq/struct.Agent.html). I am not that familiar with `ureq`, but I think calling `ureq::get` will create a new agent, perform a new DNS lookup, and create a new HTTP connection for every single request; if you use a single `Agent` instead, it should be possible to reuse connections (which would almost certainly be a major improvement). – Coder-256 Oct 28 '21 at 06:42

0 Answers0