0

I'd like to search text on a lot of websites at once. From what I understand, of which I don't know if I understand correctly, this code shouldn't work well.

from twisted.python.threadpool import ThreadPool
from txrequests import Session

pages = ['www.url1.com', 'www.url2.com']
with Session(pool=ThreadPool(maxthreads=10)) as sesh:
   for pagelink in pages:
       newresponse = sesh.get(pagelink)
       npages, text = text_from_html(newresponse.content)

My relevant functions are below (from this post), I don't think their exact contents (text extraction) are important but I'll list them just them in case.

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return soup.find_all('a'), u" ".join(t.strip() for t in visible_texts)

If I was executing this sesh.get() without the extra functions below, from what I understand: a request is sent, and perhaps it might take a while to come back with a response. In the time that it takes for this response to come, other requests are sent; and perhaps responded to before some prior requests.

If I was only making requests within my for loop, this would happen as planned. But if I put functions within the for loop, am I stopping the requests from being asynchronous? Wouldn't the functions wait for the response in order to be executed? How can I smoothly execute something like this?

I'd also be open to suggestions of different modules, I have no particular attachment to this one - I think it does what I need it to, but I realize it's not necessarily the only module that can.

Estif
  • 182
  • 1
  • 10
  • Just brief overview of the module, it looks like it'll offload the requests to the threadpool, so it will be concurrent, but threads will be managed by the OS, so when the switching happens, you won't know. – NinjaKitty Jul 23 '19 at 22:56
  • I'm sorry I don't understand the vocab of this sphere of coding very well. I get that the module will manage the requests and their threading, but what of the function if it's included in the loop? I just find it confusing because python is supposed to be very linear. – Estif Jul 23 '19 at 23:10
  • Python is linear because it uses 1 thread to execute code. However, txrequests is using the ThreadPool you created from twisted, you now have a pool of other threads that txrequests will offload the requests to. So your for loop will probably run on the main thread, while your get requests will happen on other threads via the ThreadPool that you have. This is my best guess/explanation. i don't have personal experience with these libraries, but that's what it looks like to me at first glance [cont] – NinjaKitty Jul 24 '19 at 00:03
  • taking another look, it might actually not do what i initially stated. It looks like you need to use twisted's `defer.inlineCallbacks` decorator to make it asynchronous like I was describing before. There's multiple ways to do asynchronous behavior. I personally use asyncio + aiohttp to do asynchronous web requests, but there's also threading + requests, and a lot of other combinations. threading + requests might be the easiest to understand and reason with. On a different note, it looks like txrequests forked from requests-futures, which txrequests isn't updated anymore. – NinjaKitty Jul 24 '19 at 00:09

0 Answers0