0

I am making an email-scraper and the pseudo-system is as follows

Stage 1.
1.Fetch all links from url
Stage 2.
2.Scrape emails
Stage 3.
3.Scrape links 
Stage 4.
4. If all links are processed, go to end_scene(which just asks me where i want to save em etc)
4.1 if an interruption has happend, go to end_scene

The main action part is in stage2 under while len(unprocessed_urls) I would have my logic to create urls and a try except for requesting the response of the urls, heres where the magic happens. Here i can simply put an except KeyboardInterrupt and send it to my function.

Now the problem comes at stage3 where I am scraping the emails, this part isn't in any try/except blocks so I cant really implement an interrupter or im not sure how to without an abrupt stop

The core problem is that there's a certain moment where if I press ctrl+c It throws the default error exception and my code is never run.

Here is the logic:

   # process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):

     ...URL processing...

     try:       
        ...heres the request is made...
        response = requests.get(url, timeout=3)
        done = True
    except requests.exceptions.ConnectionError as e:
        print("\n[ERROR]Connection Error:")
        print(e)
        continue
    except requests.Timeout as e:   
        print("\n[ERROR]Connection Timeout:")
        print(e)
        continue
    except requests.HTTPError as e:   
        print("\n[ERROR]HTTP Error:")
        print(e)
        continue
    except requests.RequestException as e:   
        print("\n[ERROR]General Error:")
        print(e)
        continue    
        ...this works...
        # Check for CTRL+C interruption
    except KeyboardInterrupt:
            end_scene()

    # extract all email addresses and add them into the resulting set
      ...email extraction logic...

    if len(new_emails) is 0:
       ...print no emails...
    else:
       ...print emails found...        
    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

    # Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
    for anchor in soup.find_all("a"):
        # extract link url from the anchor
        link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links (starting with /)
        if link.startswith('/'):
            link = base_url + link
        elif not link.startswith('http'):
            link = path + link

            # add the new url to the queue if it was not in unprocessed list nor in processed list yet
            if not link in unprocessed_urls and not link in processed_urls:
                unprocessed_urls.append(link)

So the question is, how can I build my code to rest assured when any keyboardInterruptions are initiated I can run my code ?

EricTalv
  • 1,000
  • 1
  • 13
  • 26
  • wrap your code in a big try/except KeyboardInterrupt. – Jean-François Fabre Feb 15 '19 at 17:39
  • @Jean-FrançoisFabre That sounds, strange. Can't the use of [signal](https://stackoverflow.com/questions/1112343/how-do-i-capture-sigint-in-python) be a valid option instead? – Torxed Feb 15 '19 at 17:42
  • I tried the Big try/except before but the problem with that is that Any error inside will mess everything up and isn't reliable. @Torxed your suggestion worked flawlessly, signal is the way to go at least it seems right now – EricTalv Feb 15 '19 at 18:43
  • @Torxed not sure if this signal stuff is portable under windows. Why not using exceptions? – Jean-François Fabre Feb 15 '19 at 20:09
  • @Jean-FrançoisFabre: It is: https://i.imgur.com/KFV2k6I.png (signal.pause() doesn't exist tho, so replace with a while loop) – Torxed Feb 15 '19 at 20:19

2 Answers2

0

I feel like this might not the right way, but you could try using a contextmanager:

import time
from contextlib import contextmanager

# build your keyboard interrupt listener
@contextmanager
def kb_listener(func):
    print('Hey want to listen to KeyboardInterrupt?')
    try:
        yield func
    except KeyboardInterrupt:
        print("Who's there?")
        interrupt()      # <--- what you actually want to do when KeyboardInterrupt

    # This might not be necessary for your code
    finally:             
        print('Keyboa^C')

# sample KeyboardInterrupt event
def interrupt():         
    print("KeyboardInterrupt.")

# sample layered function
def do_thing():          
    while True:
        print('Knock Knock')
        time.sleep(1)

with kb_listener(do_thing) as f:
    f()

Test output:

Hey want to listen to KeyboardInterrupt?
Knock Knock
Knock Knock
Knock Knock
Who's there?
KeyboardInterrupt.
Keyboa^C

At least this way you don't need to wrap your entire function in a try... except block.

r.ook
  • 13,466
  • 2
  • 22
  • 39
0
#!/usr/bin/env python
import signal
import sys
def signal_handler(sig, frame):
        print('You pressed Ctrl+C!')
        sys.exit(0)
signal.signal(signal.SIGINT, signal_handler)
print('Press Ctrl+C')
signal.pause()

scissored from How do I capture SIGINT in Python?

I would suggest you use and register the appropriate signal handler, at least if your main goal is to simply catch any user-/system-interrupts.

It's a nice way to clean up any exits / interrupts.
Can also be used if you're running your application as a service to handle shutdown events and such.

Torxed
  • 22,866
  • 14
  • 82
  • 131