1

I have created a web bot that iterates over a website e.g example.com/?id=int where int is some integer. the function gets the result in raw html using requests library then hands it to parseAndWrite to extract a div and save its value in a sqlite db:

def archive(initial_index, final_index):
    while True:
        try:
            for i in range(initial_index, final_index):
                res = requests.get('https://www.example.com/?id='+str(i))
                parseAndWrite(res.text)
                print(i, ' archived')

        except requests.exceptions.ConnectionError:
            print("[-] Connection lost. ")
            continue
        except: 
            exit(1)
        break 

archive(1, 10000)

My problem is that, after some time, the loop doesn't continue to 10000 but repeats itself from a random value, resulting in many duplicate records in the database. What is causing this inconsistency ?

ahmelq
  • 593
  • 7
  • 11
  • 5
    At any connection error your restart from 1. – Daniel Feb 19 '20 at 19:12
  • @Daniel, ach! how couldn't i notice this stupid problem... – ahmelq Feb 19 '20 at 19:15
  • 2
    Put the `try/except` around the body of the `for` loop rather than around the whole loop. Then you can get rid of the `while True` loop. – Barmar Feb 19 '20 at 19:15
  • 1
    Unrelated, but you should probably just let the other exceptions propagate out of your function rather than exiting immediately. Let whoever called `archive` decide what to do with an unexpected error. – chepner Feb 19 '20 at 19:22

3 Answers3

3

I think your two loops are nested in the wrong order. The outer while loop is supposed to retry any URLs that cause connection errors, but you've put it outside the for loop the iterates over the URL numbers. That means you always start from the initial index whenever an error happens.

Try swapping the loops, and you'll only repeat one URL until it works:

def archive(initial_index, final_index):
    for i in range(initial_index, final_index):
        while True:
            try:
                res = requests.get('https://www.example.com/?id='+str(i))
                parseAndWrite(res.text)
                print(i, ' archived')

            except requests.exceptions.ConnectionError:
                print("[-] Connection lost. ")
                continue
            except: 
                exit(1)
            break 

archive(1, 10000)
Blckknght
  • 100,903
  • 11
  • 120
  • 169
3

A general rule for a try statement is to execute as little code as possible inside one; only put the code you expect will produce the error you want to catch in it; all other code goes before or after the statement.

Don't catch errors you don't know what to do with. Exiting the program is rarely the right thing to do; that will happen anyway if no one else catches the exception, so given your caller the chance to handle it.

And finally, don't build URLs yourself; let the requests library do that for you. The base URL is http://www.example.com; the id parameter and its value can be passed via a dict to requests.get.

Your outer loop will iterate over the various parameters used to construct the URL; the inner loop will try the request until it succeeds. Once the inner loop terminates, then you can use the response to call parseAndWrite.

def archive(initial_index, final_index):

    base_url = 'https://www.example.com/'
    for i in range(initial_index, final_index + 1):
        while True:
            try:
                res = requests.get(base_url, params={'id': i})
            except requests.exception.ConnectionError:
                print("[-] Connection lost, trying again")
                continue
            else:
                break
        parseAndWrite(res.text)
        print('{} archived'.format(i))

archived(1, 10000)

You might also consider letting requests handle the retries for you. See Can I set max_retries for requests.request? for a start.

chepner
  • 497,756
  • 71
  • 530
  • 681
1

If any connection error occurs, you restart at initial_index. Instead, you could retry the current index again and again, until the connection succeeds:

def archive(initial_index, final_index):
    for i in range(initial_index, final_index):
        while True:
            try:
                response = requests.get(f'https://www.example.com/?id={i}')
                parseAndWrite(response.text)
                print(f'{i} archived')
            except requests.exceptions.ConnectionError:
                print("[-] Connection lost. ")
            else:
                break 

archive(1, 10000)
Daniel
  • 42,087
  • 4
  • 55
  • 81