0

so I have a code that needs to do HTTP requests (let's say 1000). I approached it in 3 ways so far with 50 HTTP requests. The results and codes are below.

The fastest is the approach using Threads, issue is that I lose some data (from what I understood due to the GIL). My questions are the following:

  1. My understanding it that the correct approach in this case is to use Multiprocessing. Is there any way I can improve the speed of that approach? Matching the Threading time would be great.

  2. I would guess that the higher the amount of links I have, the more time the Serial and Threading approach would take, while the Multiprocessing approach would increase much more slowly. Do you have any source that will allow me to get an estimate of the time it would take to run the code with n links?

Serial - Time To Run around 10 seconds

def get_data(link, **kwargs):
    data = requests.get(link)
    if "queue" in kwargs and isinstance(kwargs["queue"], queue.Queue):
        kwargs["queue"].put(data)
    else:
        return data

links = [link_1, link_2, ..., link_n]
matrix = []
for link in links:
    matrix.append(get_data(link))

Threads - Time To Run around 0.8 of a second

def get_data_thread(links):
    q = queue.Queue()
    for link in links:
        data = threading.Thread(target = get_data, args = (link, ), kwargs = {"queue" : q})
        data.start()
    data.join()
    return q

matrix = []
q = get_data_thread(links)
while not q.empty():
    matrix.append(q.get())

Multiprocessing - Time To Run around 5 seconds

def get_data_pool(links):
    p = mp.Pool()
    data = p.map(get_data, links)
    return data

if __name__ == "__main__":
    matrix = get_data_pool(links)
Lorenzo
  • 111
  • 3
  • 10
  • I'm sort of skeptical of these tests. The "threads" version really just runs a new thread for _each_ link to completion before spawning the next, so it's really a serial test with the overhead of starting/joining a thread. I have a hard time believing that it's more than an order of magnitude faster than the serial version. – bnaecker Oct 28 '20 at 18:41
  • @bnaecker, I don't think it is, because in the time between sending the request and receiving the response another Thread can run. Issue is that (from what I read online), multiple Threads can try to access the Queue at the same time, causing loss of data – Lorenzo Oct 28 '20 at 19:05
  • No, another thread _cannot_ run. You call `data.start()` and then `data.join()` immediately after. That second call _blocks_ until the referenced thread completes (i.e., `get_data` returns). You are running at most one thread at a time. And no, multiple threads cannot access the queue simultaneously. The standard library `queue.Queue` object is thread-safe by design. If you are losing data it is due to some other bug in your code, such as not correctly handling errors. – bnaecker Oct 28 '20 at 19:09
  • Sorry, edited the code, I pasted it here incorrectly, the .join() is outside the loop. To me the issue is the same as this question https://stackoverflow.com/questions/11464750/python-multithreading-missing-data moreover, if you look at this tutorial, minute 8.00, seems to me the same implementation as I did https://www.youtube.com/watch?v=cdPZ1pJACMI I tried to run the code many times. Serial takes 10 - 12 seconds, Threaded takes 0.8 - 1.2 seconds but misses 10% - 20% of the values. Not sure why at this point, if you say my interpretation is incorrect – Lorenzo Oct 28 '20 at 19:21
  • In your threading example you’re overwriting `data` with a new value every time round your for loop - so that thread will be at best orphaned leaving only the last one. Append them to a list in the for loop then you can later iterate the list `join`ing them. – DisappointedByUnaccountableMod Oct 28 '20 at 19:30
  • _That_ makes more sense :). But you still have a problem, because you do not join on _all_ threads. You'd need to collect all thread handles into a list, then join on all them. You could miss values later if one thread takes a long time to collect its results. The queue would appear empty because some thread has not yet pushed its results. – bnaecker Oct 28 '20 at 19:30
  • There are loads of examples of web scraping using python and e.g. asyncio and threads and multiprocessing - you’ll find them if you do even a little bit of searching – DisappointedByUnaccountableMod Oct 28 '20 at 19:36
  • Thanks guys, understood the issue better now. I'll see if I can find a solution to keep it quick. I'll look into async as well. – Lorenzo Oct 28 '20 at 19:45

1 Answers1

2

If I were to suggest anything, I would go with AIOHTTP. A sketch of the code:

import aiohttp
import asyncio

async def main(alink):
    links = [link_1, link_2, ..., link_n]
    matrix = []

    async with aiohttp.ClientSession() as session:
        async with session.get(alink) as resp:
            return resp.data()


if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    
    for link in links:
        loop.run_until_complete(main(link))
sophros
  • 14,672
  • 11
  • 46
  • 75
  • Didn't know this approach, which is probably better. For whoever needs it, here the advantages of async over threads: https://stackoverflow.com/questions/4024056/threads-vs-async – Lorenzo Oct 28 '20 at 21:40
  • 1
    @Lorenzo thanks for the links, it really helped me undertand the differences. – Jeff C May 17 '21 at 21:28