I am making an email-scraper and the pseudo-system is as follows
Stage 1. 1.Fetch all links from url Stage 2. 2.Scrape emails Stage 3. 3.Scrape links Stage 4. 4. If all links are processed, go to end_scene(which just asks me where i want to save em etc) 4.1 if an interruption has happend, go to end_scene
The main action part is in stage2 under while len(unprocessed_urls)
I would have my logic to create urls and a try except
for requesting the response of the urls, heres where the magic happens.
Here i can simply put an except KeyboardInterrupt
and send it to my function.
Now the problem comes at stage3 where I am scraping the emails, this part isn't in any try/except
blocks so I cant really implement an interrupter or im not sure how to without an abrupt stop
The core problem is that there's a certain moment where if I press ctrl+c
It throws the default error exception and my code is never run.
Here is the logic:
# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):
...URL processing...
try:
...heres the request is made...
response = requests.get(url, timeout=3)
done = True
except requests.exceptions.ConnectionError as e:
print("\n[ERROR]Connection Error:")
print(e)
continue
except requests.Timeout as e:
print("\n[ERROR]Connection Timeout:")
print(e)
continue
except requests.HTTPError as e:
print("\n[ERROR]HTTP Error:")
print(e)
continue
except requests.RequestException as e:
print("\n[ERROR]General Error:")
print(e)
continue
...this works...
# Check for CTRL+C interruption
except KeyboardInterrupt:
end_scene()
# extract all email addresses and add them into the resulting set
...email extraction logic...
if len(new_emails) is 0:
...print no emails...
else:
...print emails found...
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text, 'lxml')
# Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
for anchor in soup.find_all("a"):
# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''
# resolve relative links (starting with /)
if link.startswith('/'):
link = base_url + link
elif not link.startswith('http'):
link = path + link
# add the new url to the queue if it was not in unprocessed list nor in processed list yet
if not link in unprocessed_urls and not link in processed_urls:
unprocessed_urls.append(link)
So the question is, how can I build my code to rest assured when any keyboardInterruptions are initiated I can run my code ?