I'd like to search text on a lot of websites at once. From what I understand, of which I don't know if I understand correctly, this code shouldn't work well.
from twisted.python.threadpool import ThreadPool
from txrequests import Session
pages = ['www.url1.com', 'www.url2.com']
with Session(pool=ThreadPool(maxthreads=10)) as sesh:
for pagelink in pages:
newresponse = sesh.get(pagelink)
npages, text = text_from_html(newresponse.content)
My relevant functions are below (from this post), I don't think their exact contents (text extraction) are important but I'll list them just them in case.
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return soup.find_all('a'), u" ".join(t.strip() for t in visible_texts)
If I was executing this sesh.get()
without the extra functions below, from what I understand: a request is sent, and perhaps it might take a while to come back with a response. In the time that it takes for this response to come, other requests are sent; and perhaps responded to before some prior requests.
If I was only making requests within my for
loop, this would happen as planned. But if I put functions within the for loop, am I stopping the requests from being asynchronous? Wouldn't the functions wait for the response in order to be executed? How can I smoothly execute something like this?
I'd also be open to suggestions of different modules, I have no particular attachment to this one - I think it does what I need it to, but I realize it's not necessarily the only module that can.