Protect Python web scraping code against crashes

Question

I developed a web scraper, which goes through the profiles of a Facebook-like website(Lang-8) and save the required data. However, I do not know how to develop a system so that, in case the PC crashes, the code resumes from the last profile it scanned

    import requests
    from bs4 import BeautifulSoup


    profile = 1
    while profile <= max_profiles:
        url = "http://lang-8.com/" + str(profile)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, features="html.parser")
        for lang in soup.findAll('dd', {'class':'studying_lang_name'}):
            lang1 = str(lang.string)
            if lang1 == "\n\nPolish\n":
                journal = str(url) + "/journals"
                open_article(journal)
        profile += 1

def open_article(url2):
    in_page = 1
    while in_page < 5:
        source_code = requests.get(url2 + "?page=" + str(in_page))
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, features="html.parser")
        for link in soup.findAll('h3', {'class':'journal_title'}):
            href1 = str(link.find('a').get("href"))
            file_create(href1)
        in_page += 1

def file_create(linked):
    source_code = requests.get(linked)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, features="html.parser")
    for text in soup.findAll('li', {'class':'corrections_num'}):
        corrections = text.text
    for content in soup.findAll('div', {'id':'body_show_ori'}):
        text1 = content.text
    fout = open(linked[-1] + linked[-2] + linked[-3] + "_" + corrections + 
"_.txt", 'w', encoding='utf-8')
    fout.write(text1)
    fout.close()

Possible duplicate of [pause/resume a python script in middle](https://stackoverflow.com/questions/7180914/pause-resume-a-python-script-in-middle) — tripleee, Nov 08 '18 at 08:32

score 0 · Accepted Answer · answered Nov 07 '18 at 21:02

0

I would create and update a progress file as you complete a profile scrape.

After your profile += 1 add something like:

fprogress = open("progress.txt","w")
fprogress.write("%d" % profile)
fprogress.close()

Then on load where you set profile to 1:

if os.path.isfile('progress.txt'):
    fprogress = open("progress.txt", "r")
    profile = int(fprogress.read())
else:
    profile = 1

answered Nov 07 '18 at 21:02

Tom Chmielarz

26
3

Thank-you so much. This seems to solve my problem – Sarthak Rungta Nov 09 '18 at 03:42

Protect Python web scraping code against crashes

1 Answers1