As far as I can tell, my code works absolutely fine- though it probably looks a bit rudimentary and crude to more experienced eyes.
Objective:
Create a 'filter' that loops through a (large) range of possible ID numbers. Each ID should tried to log-in at the url
website. If the id
is valid
it should be saved to hit_list
.
Issue:
In large loops, the programme 'hangs' for indefinite periods of time. Although I have no evidence (no exception is thrown) I suspect this is a timeout
issue (or rather, would be if timeout
was specified)
Question:
I want to add a timeout
- and then handle the timeout
exception so that my programme will stop hanging. If this theory is wrong, I would also like to hear what my issue might be.
How to add a timeout
is a question that has been asked before: Here and here, but after spending all weekend working on this, I'm still at a loss. Put blunty, I don't understand those answers.
What I've tried:
- Create a
try
&except
block in theid_filter
function. Thetry
is atr=s.get(url)
and the exception is at the end of the function. I've read therequests
docs in detail, here and here. This didn't work. - The more I read about
futures
the more I'm convinced that excepting errors has to be done infutures
, rather thanrequests
(as I did above). So I tried inserting atimeout
in the brackets afterboss.map
, but as far as I could tell, this had no effect- it seems too simple anyway.
So, to reiterate:
For large loops (50,000 +) my programme tends to hang for an indefinite period of time (there is no exact point when this starts, though it's usually after 90% of the loop has been processed). I don't know why, but suspect adding a timeout
would throw an exception- which I can then except. This theory may, however be wrong. I have tried to add timeout
and handle other errors in the requests
part, but to no effect.
-Python 3.5
My code:
import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
hit_list =[]
processed_list=[]
startrange= 100050000
end_range = 100150000
loop_size=range(startrange,end_range)
workers= 70 #
chunks= 300
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
def id_filter(_range):
with requests.session() as s:
s.headers.update({
'user-agent': 'FOR MORE INFORMATION ABOUT THIS DATA COLLECTION PLEASE CONTACT ########'
})
r = s.get(url)
time.sleep(.1)
soup = BeautifulSoup(r.content, 'html.parser')
viewstate = soup.find('input', {'name': '__VIEWSTATE' }).get('value')
viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
validation = soup.find('input', {'name': '__EVENTVALIDATION' }).get('value')
for ber in _range:
data = {
'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
'__VIEWSTATE' : viewstate,
'__VIEWSTATEGENERATOR' : viewstategen,
'__EVENTVALIDATION' : validation,
}
y = s.post(url, data=data)
if 'No results found' in y.text:
#print('Invalid ID', ber)
else:
hit_list.append(ber)
print('Valid ID',ber)
if __name__ == '__main__':
with cf.ThreadPoolExecutor(max_workers=workers) as boss:
jobs= [loop_size[x: x + chunks] for x in range(0, len(loop_size), chunks)]
boss.map(id_filter, jobs)
#record data below