3

As far as I can tell, my code works absolutely fine- though it probably looks a bit rudimentary and crude to more experienced eyes.

Objective:

Create a 'filter' that loops through a (large) range of possible ID numbers. Each ID should tried to log-in at the url website. If the id is valid it should be saved to hit_list.

Issue:

In large loops, the programme 'hangs' for indefinite periods of time. Although I have no evidence (no exception is thrown) I suspect this is a timeout issue (or rather, would be if timeout was specified)

Question:

I want to add a timeout- and then handle the timeout exception so that my programme will stop hanging. If this theory is wrong, I would also like to hear what my issue might be.

How to add a timeout is a question that has been asked before: Here and here, but after spending all weekend working on this, I'm still at a loss. Put blunty, I don't understand those answers.

What I've tried:

  1. Create a try & except block in the id_filter function. The try is at r=s.get(url) and the exception is at the end of the function. I've read the requests docs in detail, here and here. This didn't work.
  2. The more I read about futures the more I'm convinced that excepting errors has to be done in futures, rather than requests (as I did above). So I tried inserting a timeout in the brackets after boss.map, but as far as I could tell, this had no effect- it seems too simple anyway.

So, to reiterate:

For large loops (50,000 +) my programme tends to hang for an indefinite period of time (there is no exact point when this starts, though it's usually after 90% of the loop has been processed). I don't know why, but suspect adding a timeout would throw an exception- which I can then except. This theory may, however be wrong. I have tried to add timeout and handle other errors in the requests part, but to no effect.

-Python 3.5

My code:

import concurrent.futures as cf
import requests 
from bs4 import BeautifulSoup

hit_list =[]
processed_list=[]
startrange= 100050000
end_range = 100150000
loop_size=range(startrange,end_range)
workers= 70 # 
chunks= 300

url = 'https://ndber.seai.ie/pass/ber/search.aspx'

def id_filter(_range):    
    with requests.session() as s:
         s.headers.update({
            'user-agent': 'FOR MORE INFORMATION ABOUT THIS DATA COLLECTION PLEASE CONTACT ########'
        })

    r = s.get(url)   
    time.sleep(.1)

    soup = BeautifulSoup(r.content, 'html.parser')
    viewstate    = soup.find('input', {'name': '__VIEWSTATE'          }).get('value')
    viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
    validation   = soup.find('input', {'name': '__EVENTVALIDATION'    }).get('value')

    for ber in _range:            
        data = {
            'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
            'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
            '__VIEWSTATE'                                         : viewstate,
            '__VIEWSTATEGENERATOR'                                : viewstategen,
            '__EVENTVALIDATION'                                   : validation,
        }        
        y = s.post(url, data=data)            
        if 'No results found' in y.text:
            #print('Invalid ID', ber)
        else:
            hit_list.append(ber)    
            print('Valid ID',ber)


if __name__ == '__main__':   

with cf.ThreadPoolExecutor(max_workers=workers) as boss:
    jobs= [loop_size[x: x + chunks] for x in range(0, len(loop_size), chunks)]        
    boss.map(id_filter, jobs)

#record data below

Community
  • 1
  • 1
SeánMcK
  • 392
  • 3
  • 17
  • you shouldn't say: *"my code works absolutely fine"* unless it is your goal that the code *"hang for an indefinite period of time"*. There are too many moving parts in your code. Try to create a minimal code example (remove `requests`, `bs4`, `xlsxwriter` code and fake `id_filter` (raise an exception in it, or run an infinite loop, or just call `time.sleep()` and see what happens). [mcve] – jfs May 31 '16 at 08:23
  • ok, thanks for the pointer. This stuff boggles my mind, but I've pared my code down to a minimum and am trying to replicate the problem. (annoyingly it'll likely take a while as the problem usually only occurs at the end of very large loops) – SeánMcK May 31 '16 at 10:33

0 Answers0