Python Crawler - need help with my algorithm

Question

** Added a summary of the problem at the end of the post **

I've written a crawler that fetches and parses URLs.

in the first version, in order to get the next valid page I was increasing the URL ID and saved non-valid IDs to a file, the valid URLs were moved to a parser that parsed the content I need, after a while I noticed that most valid IDs has a returning subtrahend.

I made some statistics and got to that list of subtrahends - [8,18,7,17,6,16,5,15], ordered by the most repeating to the least.

so I changed my code to -

def checkNextID(ID):
    numOfRuns = 0
    global curRes, lastResult
    while ID < lastResult:
        try:
            numOfRuns += 1
            if numOfRuns % 10 == 0:
                time.sleep(7) # sleep every 10 iterations
                numOfRuns = 0
            if isValid(ID + 8): 
                parseHTML(curRes, ID)
                ID = ID + 8
            elif isValid(ID + 18):
                parseHTML(curRes, ID)
                ID = ID + 18
            elif isValid(ID + 7):
                parseHTML(curRes, ID)
                ID = ID + 7
            elif isValid(ID + 17):
                parseHTML(curRes, ID)
                ID = ID + 17
            elif isValid(ID+6):
                parseHTML(curRes, ID)
                ID = ID + 6
            elif isValid(ID + 16):
                parseHTML(curRes, ID)
                ID = ID + 16
            elif isValid(ID + 5):
                parseHTML(curRes, ID)
                ID = ID + 5
            elif isValid(ID + 15):
                parseHTML(curRes, ID)
                ID = ID + 15
            else:
                if isValid(ID + 1):
                    parseHTML(curRes, ID)
                ID = ID + 1
        except Exception, e:
            print "something went wrong: " + str(e)

isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global variable named - 'curRes', it returns False if the url doesn't contain the data I need and saves the id to 'badIdFile'.

parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.

in a perfect world that piece of code would be everything I need in order to run on all valid IDs (there are about 400K in a range of 5M), it gave me better results in less time (x50 faster).

BUT, when getting to ranges that doesn't contain any valid URL, my code is very inefficient, it turns that I'm crawling mostly the same URLs over and over in each iteration, this is because I'm increasing the ID by one in order to keep progressing until I'll find the next valid URL and then I'm checking the ID + 8, then 18, 17 etc' which sometimes gives me the same URLs I checked in the previous iteration.

so I went and changed the code so it will keep a a set of non-valid URLs which I'll avoid checking again and I can't get it to work, I'm breaking my head for a few hours now, it's not working like it should.

this is my new function -

def checkNextID(ID):
    runs = 0
    badTries = set()
    diff = [8,18,7,17,6,16,5,15,1]
    global curRes, lastResult
    while ID < lastResult:
        if len(badTries) == 100: #every 100 non-valid IDs I can reset badTries(I think), if I've increased the ID 100 times it means I won't be checking previous IDs and I can start saving the non-valid IDs again.
            badTries = set()
        try:
            for i in diff:
                tryID = ID + i #tryID will be the ID + one of the subtrahends.
                if tryID not in badTries: #if we haven't already tried that ID
                    if runs % 10 == 0:
                        time.sleep(6)  #sleep every 10 requests
                    if isValid(tryID):
                        runs += 1
                        parseHTML(curRes, ID)
                        ID = tryID
                        badTries = set() #I can reset the badTries now, I'm moving forward.
                        break #will move to the next id.
                    else:
                        badTries.add(ID) #save the non-valid ID to the set
                        if i == 1: #it's the last iteration of the for loop, if the subtrahend is 1 and it's not valid I must increase the ID by 1 in order to go forward.
                            ID += 1
                else:
                    ID += 1 #if the ID is not a valid ID and I've already checked it I must increase the ID by one or I'll get into an infinite loop.
        except Exception, e:
            print "something went wrong: " + str(e) + ' ID - ' + str(ID)

I'm saving each non-valid ID to a set, and before every call to isValid() I'm checking if I've already tried that ID, if I didn't, I'm calling isValid(), else, ID is increased by one.

that's how the bad ID file looks like -

as you can see, it's not working, I know I have a flaw in the whole design but I can't find it, I would really appreciate your help.

a summary of the problem -

I have a diff list [8,18,7,17,6,16,5,15] the program starts with getting an id, each time I need to check the next id which is - id + diff[i] (i=0) if (id + diff[i]) isn't a valid id I'm checking the next id which is (id + diff[i+1]).

if there isn't a valid id on that iteration (id + diff[i..n]) I'm increasing id by 1, and check if id+1 is valid id, if not I'm checking again with id + diff[i..n], until I'll find the next valid id.

in each iteration I'm checking ids I've already checked in the previous iteration (which costs a lot of time and is inefficient), I need to avoid checking the IDs I've already checked and keep increasing the ID until I'll find the next valid ID.

as for now, if the id = 1 (and it's a valid id) and diff = [8,18,7,17,6,16,5,15].

first iteration will look like (I'm marking with bold the id's which I could avoid checking) - first - id = 1

9, 19, 8, 18, 7, 17, 6, 16, 2

second - id = 2

10, 20, 9, 19, 8, 18, 7, 17, 3

third - id = 3

11, 21, 10, 20, 9, 19, 8, 18, 4

fourth - id = 4

12, 22 - BINGO, next valid ID is 22!

that cost me 29 requests, instead of - 17, and that's a small example, I have ranges that are 300-600 ids from the last valid id.

I can't get my code to avoid checking previously checked ids with a smart and efficient way.

Thanks!

eyquem · Answer 1 · 2011-08-01T10:44:53.137

Stating an identifier as being global is done when one wants to provoke a modification concerning it when it is outside the function from which the modification is performed.

Hence it is an aberration to make lastResult and curRes global:

the first, lastResult, because it is a constant in your complete code. The best is to define a parameter lastResult of the function checkNextID() with the value of lastResult as default argument.
the second, curRes, because there is no modification concerning this identifier in checkNextID()

Now, defining curRes as global in the function isValid() is also a bad practice, though not an aberration: 1) a new value for curRes is sent from the inside of isValid() to the outside of it; 2) then, the program goes outside the function checkNextID() to search for the value of curRes. That's a weird and useless detour, you could let curRes be a free variable (see doc ) in the function checkNextID() and this one will automatically go outside to resolve this name and to obtain its value.

.

Personally , I prefer to reorganise the general algorithm. In my following code, curRes is defined as a local object, taking directly its value from the return of the function isValid() . That requires to redefine isValid(): in my code isValid() returns the object soup or False

I hope I understood your need. Say me what's wrong in my approach, please.

def checkNextID(ID, lastResult = lastResult, diff = [0,1,5,6,7,8,15,16,17,18]):
    runs = 0
    maxdiff = max(diff)
    diff.extend(x for x in xrange(maxdiff) if x not in diff)
    while True:
        for i in diff:
            if ID+i==lastResult:  break
            runs += 1
            if runs % 10 == 0:  time.sleep(6)
            curRes = isValid(ID+i):
            if cuRes:
                parseHTML(curRes, ID+i)
                ID = ID + i
                break
        else:
            runs += 1
            ID += maxdiff + 1
            if ID==lastResult:  break






def isValid(ID, urlhead = urlPath):
    # this function return either False OR a BeautifulSoup instance
    try:
        page = getPAGE(urlhead + str(ID))
        if page == False:  return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                return soup #we have the data we need, save the soup obj to a global variable
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False

.

Also, to speed up your program:

you should use regex tool (module re) instead of BeautifulSoup that is roughly 10 times slower than the use of a regex
you shouldn't define and use all these functions in checkNextID (saveToCSV, parseHTML, isValid) : each call to a function takes an additional amount of time comparatively to a direct code

.

Final Edit

To conclude this long study of your problem, I did a benchmark. Here after follow the codes and the results that show that my intuition is borne out: my code #2 takes at least 20 % less time to run than your code #1 . Your code #1:

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    SEEN = set()
    li = []

    while True:
        if ID>lastResult:
            break
        if ID in SEEN:
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                li.append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN.add(ID+i)
                        if curRes:
                            li.append(ID+i)
                            ID += i
                            break
                    else:
                        ID += 1
                        break
            else:
                ID += 1

    return li


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te,'seconds'
print checkNextID(0)

My code #2

from time import clock

lastResult = 200

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]

    li = []

    while True:
        if ID>lastResult:
            break
        else:
            curRes = isValid(ID)
            if curRes:
                li.append(ID)
                while True:
                    for i in diff:
                        curRes = isValid(ID+i)
                        if curRes:
                            li.append(ID+i)
                            ID += i
                            break
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                ID += j
                                break
                            curRes = isValid(ID+j)
                            if curRes:
                                li.append(ID+j)
                                ID += j
                                break

                        if j==lastothers:
                            ID += maxdiff + 1
                            break
                        elif ID>lastResult:
                            break

            else:
                ID += 1

    return li




def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98,114,129,137,154,166,175,180,184,200)):
    return ID in valid_ones

te = clock()
for i in xrange(10000):
    checkNextID(0)
print clock()-te,'seconds'
print checkNextID(0)

Results:

your code
0.398804596674 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

my code
0.268061164198 seconds
[1, 9, 17, 25, 30, 50, 52, 60, 83, 98, 114, 129, 137, 154, 166, 184, 200]

0.268061164198 / 0.398804596674 = 67.3 %

I've tried also with lastResult = 100 , I got 72 % .
And with lastResult = 480, I got 80 %.

Hi, I can understand the approach of returning the soup object, its smarter! — YSY, Jul 31 '11 at 10:07
But I dont think that CheckNextID() is right, firstly theres no need to call IsValid() twice, its sending the same GET req' over, Second, I think that you missed my point, you're creating a whole list in diff and run on it? You shold set the next id to check by the output of IsValid(), first you'll try +8 which is the most returning diff and then you need to move on if you wont find any id, you'll increase the id by 1, in your code you always check that the next id isnt equal to the last one, but in my case you need to keep a record for all the last tries which were not valid so you'll skip — YSY, Jul 31 '11 at 10:16
the ids you've tried, There's a reason why i chose to sort the diffs like i did, if i have a valid id, the chanses that the next id is +8 are very high, if not it should be one of (+) 18, 7, 17, 6, 16, 5,1 if i still dont find the next id i need to add +1 and then check again with the diff list (now i'll have some ids which i already tried and would want to skip and increase by 1 and check again. In your solution i'm running on the ids from the smallest to the biggest, if i dont find an id i increase by 18, if its the last result i tried, break, but it can be a result i tried a few reqeusts — YSY, Jul 31 '11 at 10:26
Back.. and then i'l check it again.. Or did i misunderstand you? (writing and reading from iphone) — YSY, Jul 31 '11 at 10:27
Ok, I misunderstood you, you're checking if next id = lastResult, break, which means if we got to the end id, btw, I think you're solution will never end, you are breaking the for loop and not the while true, and you'll send a lot of unnecessary requests, by trying - 1,2,3,4,9,10,11,12,13,14, in my case I know that the diffrences between 2 ids will be [8,18,7,17,6,16,5,15] if not then its a "dead range" and I need to find the next valid id, so i'll add 1, and then try again with the list, if i javent found the next one i'll add 1 again and do the check, in most cased i'm not in a dead range.. — YSY, Jul 31 '11 at 10:58
In non dead ranges almost all diffs are +8 or +7 or + 18... and almost every reqeust is a valid one.. Untill i get to a dead range and that logic causes a lot of duplicate requests (adding the diff + id and inrease the id with 1 each time will give similar results slmost every time, like i showed in my example), Thanks! — YSY, Jul 31 '11 at 11:03
@YSY You're right: **isValid()** must be called only one time, it was indeed my intent, the second call is an error of writing. I corrected this point in my answer. - Concerning the algorithm, I'm not sure that you understood my code, and I'm not sure to have perfectly understood your intent. I'll try to write a test to clear the matter. By the way, it is useless to repeat again and again the same things in different comments, questions and threads; you must explain them differently — eyquem, Jul 31 '11 at 14:38
the algorithm should take care good and dead ranges, it should run on an id + the diff's items, if there isn't a valid id, it should increase the id by 1, in case the id wasn't valid I need to think if I want to rerun on the new id with the diff's list while avoiding checking previously checked IDs, or should I keep increasing the id and check if it is a valid ID, just if it's a valid ID i will return to check the next ID with the diffs list, what is your opinion? — YSY, Jul 31 '11 at 15:00
this is an example, valid IDs - 1,9,17,25. now lets say the next valid id is 50, if I'll increase by the items of the diff list ([8,18,7,17,6,16,5,15]) I will not get the next valid ID, so I would increase 25 by 1, and then I'll check if 26 is valid, is not, I'll keep increasing until I'll get a valid id, if 26 is a valid id I will check the next valid id with 26 and the items of the diff list, in the case that 50 is the next valid id I will keep increasing by 1 and check whether it's a valid id, until I'll get to 50 while avoid checking previously checked IDs, when I'll hit 50 I'll start with — YSY, Jul 31 '11 at 15:08
Thank you so much! I've used your 2# solution in the end, already been used in my program! I learned a lot from your solutions and thinking, I wish I could add you some more reputation :) Thanks Man! — YSY, Aug 01 '11 at 14:17
@YSY Thank you. Don't forget that if you use a regex, and if you condense code to eliminate some functions, you will gain additional speed. — eyquem, Aug 01 '11 at 14:43

eyquem · Accepted Answer · 2011-08-01T10:40:28.580

I think I've got it.

First, the code based on your idea:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()

    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '-----------------\nID=='+str(ID)+'  already seen, not examined'
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '--------------------------\nID=='+str(ID)+'  vaaaalid'
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '--------------------------\nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print '--------------------------\nID==%s  not valid' % ID
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

result

--------------------------
ID==0  not valid
--------------------------
ID==1  vaaaalid
   i== 8   ID+i==9   valid
--------------------------
ID==9
   i== 8   ID+i==17   valid
--------------------------
ID==17
   i== 8   ID+i==25   valid
--------------------------
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
--------------------------
ID==26  not valid
--------------------------
ID==27  not valid
--------------------------
ID==28  not valid
--------------------------
ID==29  not valid
-----------------
ID==30  already seen, not examined
-----------------
ID==31  already seen, not examined
-----------------
ID==32  already seen, not examined
-----------------
ID==33  already seen, not examined
--------------------------
ID==34  not valid
--------------------------
ID==35  not valid
--------------------------
ID==36  not valid
--------------------------
ID==37  not valid
--------------------------
ID==38  not valid
--------------------------
ID==39  not valid
-----------------
ID==40  already seen, not examined
-----------------
ID==41  already seen, not examined
-----------------
ID==42  already seen, not examined
-----------------
ID==43  already seen, not examined
--------------------------
ID==44  not valid
--------------------------
ID==45  not valid
--------------------------
ID==46  not valid
--------------------------
ID==47  not valid
--------------------------
ID==48  not valid
--------------------------
ID==49  not valid
--------------------------
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
--------------------------
ID==51  not valid
--------------------------
ID==52  vaaaalid
   i== 8   ID+i==60   valid
--------------------------
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
--------------------------
ID==61  not valid
--------------------------
ID==62  not valid
--------------------------
ID==63  not valid
--------------------------
ID==64  not valid
-----------------
ID==65  already seen, not examined
-----------------
ID==66  already seen, not examined
-----------------
ID==67  already seen, not examined
-----------------
ID==68  already seen, not examined
--------------------------
ID==69  not valid
--------------------------
ID==70  not valid
--------------------------
ID==71  not valid
--------------------------
ID==72  not valid
--------------------------
ID==73  not valid
--------------------------
ID==74  not valid
-----------------
ID==75  already seen, not examined
-----------------
ID==76  already seen, not examined
-----------------
ID==77  already seen, not examined
-----------------
ID==78  already seen, not examined
--------------------------
ID==79  not valid
--------------------------
ID==80  not valid
--------------------------
ID==81  not valid
--------------------------
ID==82  not valid
--------------------------
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
--------------------------
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
-----------------
ID==99  already seen, not examined
-----------------
ID==100  already seen, not examined

==========================
ID==101
   ID>lastResult is True : program STOPS

.

And here's the code based on my idea:

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    maxdiff = max(diff)
    others = [x for x in xrange(1,maxdiff) if x not in diff]
    lastothers = others[-1]
    SEEN = set()

    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '-----------------\nID=='+str(ID)+'  already seen, not examined'
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '------------------------------------\nID=='+str(ID)+'  vaaaalid'
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '--------------------------\nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        for j in others:
                            if ID+j>lastResult:
                                print '\n   j==%2s   %s+%s==%s>lastResult==%s is %s' \
                                      % (j,ID,j,ID+j,lastResult,ID+j>lastResult)
                                ID += j
                                print '\n--------------------------\nnow ID==',ID
                                break
                            runs += 1
                            if runs % 10 == 0:  time.sleep(0.5)
                            curRes = isValid(ID+j)
                            SEEN.add(ID+j)
                            if curRes:
                                print '   j==%2s   ID+j==%s   valid' % (j,ID+j)
                                ID += j
                                print '--------------------------\nID=='+str(ID)
                                break
                            else:
                                print '   j==%2s   ID+j==%s   not valid' % (j,ID+j)

                        if j==lastothers:
                            ID += maxdiff + 1
                            print '   ID += %s + 1 ==> ID==%s' % (maxdiff,ID)
                            break
                        elif ID>lastResult:
                            print '   ID>lastResult==%s>%s is %s ==> WILL STOP' % (ID,lastResult,ID>lastResult)
                            break

            else:
                print '-------------------------\nID=='+str(ID)+'  not valid'
                ID += 1




def isValid(ID, valid_ones = (1,9,17,25,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

result

-------------------------
ID==0  not valid
------------------------------------
ID==1  vaaaalid
   i== 8   ID+i==9   valid
--------------------------
ID==9
   i== 8   ID+i==17   valid
--------------------------
ID==17
   i== 8   ID+i==25   valid
--------------------------
ID==25
   i== 8   ID+i==33   not valid
   i==18   ID+i==43   not valid
   i== 7   ID+i==32   not valid
   i==17   ID+i==42   not valid
   i== 6   ID+i==31   not valid
   i==16   ID+i==41   not valid
   i== 5   ID+i==30   not valid
   i==15   ID+i==40   not valid
   j== 1   ID+j==26   not valid
   j== 2   ID+j==27   not valid
   j== 3   ID+j==28   not valid
   j== 4   ID+j==29   not valid
   j== 9   ID+j==34   not valid
   j==10   ID+j==35   not valid
   j==11   ID+j==36   not valid
   j==12   ID+j==37   not valid
   j==13   ID+j==38   not valid
   j==14   ID+j==39   not valid
   ID += 18 + 1 ==> ID==44
-------------------------
ID==44  not valid
-------------------------
ID==45  not valid
-------------------------
ID==46  not valid
-------------------------
ID==47  not valid
-------------------------
ID==48  not valid
-------------------------
ID==49  not valid
------------------------------------
ID==50  vaaaalid
   i== 8   ID+i==58   not valid
   i==18   ID+i==68   not valid
   i== 7   ID+i==57   not valid
   i==17   ID+i==67   not valid
   i== 6   ID+i==56   not valid
   i==16   ID+i==66   not valid
   i== 5   ID+i==55   not valid
   i==15   ID+i==65   not valid
   j== 1   ID+j==51   not valid
   j== 2   ID+j==52   valid
--------------------------
ID==52
   i== 8   ID+i==60   valid
--------------------------
ID==60
   i== 8   ID+i==68   not valid
   i==18   ID+i==78   not valid
   i== 7   ID+i==67   not valid
   i==17   ID+i==77   not valid
   i== 6   ID+i==66   not valid
   i==16   ID+i==76   not valid
   i== 5   ID+i==65   not valid
   i==15   ID+i==75   not valid
   j== 1   ID+j==61   not valid
   j== 2   ID+j==62   not valid
   j== 3   ID+j==63   not valid
   j== 4   ID+j==64   not valid
   j== 9   ID+j==69   not valid
   j==10   ID+j==70   not valid
   j==11   ID+j==71   not valid
   j==12   ID+j==72   not valid
   j==13   ID+j==73   not valid
   j==14   ID+j==74   not valid
   ID += 18 + 1 ==> ID==79
-------------------------
ID==79  not valid
-------------------------
ID==80  not valid
-------------------------
ID==81  not valid
-------------------------
ID==82  not valid
------------------------------------
ID==83  vaaaalid
   i== 8   ID+i==91   not valid
   i==18   ID+i==101   not valid
   i== 7   ID+i==90   not valid
   i==17   ID+i==100   not valid
   i== 6   ID+i==89   not valid
   i==16   ID+i==99   not valid
   i== 5   ID+i==88   not valid
   i==15   ID+i==98   valid
--------------------------
ID==98
   i== 8   ID+i==106   not valid
   i==18   ID+i==116   not valid
   i== 7   ID+i==105   not valid
   i==17   ID+i==115   not valid
   i== 6   ID+i==104   not valid
   i==16   ID+i==114   not valid
   i== 5   ID+i==103   not valid
   i==15   ID+i==113   not valid
   j== 1   ID+j==99   not valid
   j== 2   ID+j==100   not valid

   j== 3   98+3==101>lastResult==100 is True

--------------------------
now ID== 101
   ID>lastResult==101>100 is True ==> WILL STOP

==========================
ID==101
   ID>lastResult is True : program STOPS

There is

    if ID in SEEN:
        print '-----------------\nID=='+str(ID)+'  already seen, not examined'
        ID += 1

in this code, but the message 'already seen' is never printed during the execution; however, the detection of valids ID has the same result; that means that the use of a set SEEN isn't necessary in my code.

.

Edit 1

The code #1 with instruction to regularly empty SEEN

import time

lastResult = 100

def checkNextID(ID, lastResult = lastResult, diff = [8,18,7,17,6,16,5,15]):
    runs = 0
    SEEN = set()
    while True:
        if ID>lastResult:
            print ('\n=========================='
                   '\nID==%s'
                   '\n   ID>lastResult is %s : program STOPS')\
                  % (ID,ID>lastResult,)
            break
        runs += 1
        if runs % 10 == 0:  time.sleep(0.5)
        if ID in SEEN:
            print '-----------------\n%s\nID==%s  already seen, not examined' % (SEEN,ID)
            ID += 1
        else:
            curRes = isValid(ID)
            if curRes:
                print '--------------------------\n%s\nID==%s  vaaaalid'  % (SEEN,ID)
                while True:
                    for i in diff:
                        runs += 1
                        if runs % 10 == 0:  time.sleep(0.5)
                        curRes = isValid(ID+i)
                        print '   '+str(SEEN)
                        if i==diff[0]:
                            SEEN = set([ID+i])
                        else:
                            SEEN.add(ID+i)
                        if curRes:
                            print '   i==%2s   ID+i==%s   valid' % (i,ID+i)
                            ID += i
                            print '--------------------------\nID==%s' % str(ID)
                            break
                        else:
                            print '   i==%2s   ID+i==%s   not valid' % (i,ID+i)
                    else:
                        ID += 1
                        break
            else:
                print '--------------------------\n%s\nID==%s  not vaaaaalid' % (SEEN,ID)
                ID += 1


def isValid(ID, valid_ones = (1,9,17,25,30,50,52,60,83,97,98)):
    return ID in valid_ones


checkNextID(0)

result

--------------------------
set([])
ID==0  not vaaaaalid
--------------------------
set([])
ID==1  vaaaalid
   set([])
   i== 8   ID+i==9   valid
--------------------------
ID==9
   set([9])
   i== 8   ID+i==17   valid
--------------------------
ID==17
   set([17])
   i== 8   ID+i==25   valid
--------------------------
ID==25
   set([25])
   i== 8   ID+i==33   not valid
   set([33])
   i==18   ID+i==43   not valid
   set([33, 43])
   i== 7   ID+i==32   not valid
   set([32, 33, 43])
   i==17   ID+i==42   not valid
   set([32, 33, 42, 43])
   i== 6   ID+i==31   not valid
   set([32, 33, 42, 43, 31])
   i==16   ID+i==41   not valid
   set([32, 33, 41, 42, 43, 31])
   i== 5   ID+i==30   valid
--------------------------
ID==30
   set([32, 33, 41, 42, 43, 30, 31])
   i== 8   ID+i==38   not valid
   set([38])
   i==18   ID+i==48   not valid
   set([48, 38])
   i== 7   ID+i==37   not valid
   set([48, 37, 38])
   i==17   ID+i==47   not valid
   set([48, 37, 38, 47])
   i== 6   ID+i==36   not valid
   set([48, 36, 37, 38, 47])
   i==16   ID+i==46   not valid
   set([36, 37, 38, 46, 47, 48])
   i== 5   ID+i==35   not valid
   set([35, 36, 37, 38, 46, 47, 48])
   i==15   ID+i==45   not valid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==31  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==32  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==33  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==34  not vaaaaalid
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==35  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==36  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==37  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==38  already seen, not examined
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==39  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==40  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==41  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==42  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==43  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==44  not vaaaaalid
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==45  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==46  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==47  already seen, not examined
-----------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==48  already seen, not examined
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==49  not vaaaaalid
--------------------------
set([35, 36, 37, 38, 45, 46, 47, 48])
ID==50  vaaaalid
   set([35, 36, 37, 38, 45, 46, 47, 48])
   i== 8   ID+i==58   not valid
   set([58])
   i==18   ID+i==68   not valid
   set([58, 68])
   i== 7   ID+i==57   not valid
   set([57, 58, 68])
   i==17   ID+i==67   not valid
   set([57, 58, 67, 68])
   i== 6   ID+i==56   not valid
   set([56, 57, 58, 67, 68])
   i==16   ID+i==66   not valid
   set([66, 67, 68, 56, 57, 58])
   i== 5   ID+i==55   not valid
   set([66, 67, 68, 55, 56, 57, 58])
   i==15   ID+i==65   not valid
--------------------------
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==51  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 55, 56, 57, 58])
ID==52  vaaaalid
   set([65, 66, 67, 68, 55, 56, 57, 58])
   i== 8   ID+i==60   valid
--------------------------
ID==60
   set([60])
   i== 8   ID+i==68   not valid
   set([68])
   i==18   ID+i==78   not valid
   set([68, 78])
   i== 7   ID+i==67   not valid
   set([67, 68, 78])
   i==17   ID+i==77   not valid
   set([67, 68, 77, 78])
   i== 6   ID+i==66   not valid
   set([66, 67, 68, 77, 78])
   i==16   ID+i==76   not valid
   set([66, 67, 68, 76, 77, 78])
   i== 5   ID+i==65   not valid
   set([65, 66, 67, 68, 76, 77, 78])
   i==15   ID+i==75   not valid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==61  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==62  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==63  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==64  not vaaaaalid
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==65  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==66  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==67  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==68  already seen, not examined
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==69  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==70  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==71  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==72  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==73  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==74  not vaaaaalid
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==75  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==76  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==77  already seen, not examined
-----------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==78  already seen, not examined
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==79  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==80  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==81  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==82  not vaaaaalid
--------------------------
set([65, 66, 67, 68, 75, 76, 77, 78])
ID==83  vaaaalid
   set([65, 66, 67, 68, 75, 76, 77, 78])
   i== 8   ID+i==91   not valid
   set([91])
   i==18   ID+i==101   not valid
   set([91, 101])
   i== 7   ID+i==90   not valid
   set([90, 91, 101])
   i==17   ID+i==100   not valid
   set([90, 91, 100, 101])
   i== 6   ID+i==89   not valid
   set([89, 90, 91, 100, 101])
   i==16   ID+i==99   not valid
   set([99, 100, 101, 89, 90, 91])
   i== 5   ID+i==88   not valid
   set([99, 100, 101, 88, 89, 90, 91])
   i==15   ID+i==98   valid
--------------------------
ID==98
   set([98, 99, 100, 101, 88, 89, 90, 91])
   i== 8   ID+i==106   not valid
   set([106])
   i==18   ID+i==116   not valid
   set([106, 116])
   i== 7   ID+i==105   not valid
   set([105, 106, 116])
   i==17   ID+i==115   not valid
   set([105, 106, 115, 116])
   i== 6   ID+i==104   not valid
   set([104, 105, 106, 115, 116])
   i==16   ID+i==114   not valid
   set([104, 105, 106, 114, 115, 116])
   i== 5   ID+i==103   not valid
   set([103, 104, 105, 106, 114, 115, 116])
   i==15   ID+i==113   not valid
--------------------------
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==99  not vaaaaalid
--------------------------
set([103, 104, 105, 106, 113, 114, 115, 116])
ID==100  not vaaaaalid

==========================
ID==101
   ID>lastResult is True : program STOPS

This above code shows the process concerning the recording and emptying the values of ID already seen. It is a code as good as it is conceivable, since the algorithm comprises the periodical emptying of SEEN, because it 's true that emptying is necessary, given the number of IDs to te be tested.

But from the beginning, my opinion was that this process concernin SEEN in this algorithm weighs on the performance, because recording and testing instructions concerning a SEEN are repeatedly executed at every step of the program.

That's why I imagined there should be another algorithm having not this drawback. I finally succeeded to write such an alternative code, and now we have two codes having two different algorithms.

Concerning your question "are you sure it's not necessary to use the SEEN logic in the second?"
I answer 'yes, I think I can be sure'. The intent of runing my code #2 with instructions managing SEEN was to make me sure after verifying what was a mental idea and a conceptual algorithm. If you want to become sure, you must do the same: - study conceptually and precisely the algorithm - do as many essays of execution of the two codes and comparison of their results as you need to be experimentally convinced, varying the values of lastResult, valid_ones and diff For me , this point is closed as long as there won't be a contradictory practical case proving that my conclusion is wrong.

I continue in the other answer, because the number of characters is limited in this one

First of all, Thanks you for your efforts to help me! in the first and second cone I think there's a need to change 'SEEN' to SeenInValid, cause a list of all requests (800K-900K at least) will consume a lot of memory, and the if ID in SEEN will take a lot of resources after a while, that's why I think it need to be emptied when ID is bigger then max(SEEN), what do you think? btw, I tried using your 2# try with a bigger list and played with the valid list and I couldn't make it print 'already seen' which is weird, isn't it? are you sure it's not necessary to use the SEEN logic in the second? — YSY, Aug 01 '11 at 05:04
@YSY You're right concerning the memory consumption by SEEN in your real case. It was implicit in my thoughts that I was posting about the algorithm, and that this problem with memory was to be treated by yourself. — eyquem, Aug 01 '11 at 07:49
@YSY Why **SeenInValid** ? In the execution of the code #1 (based on your idea), we can see: **ID==30 already seen, not examined** and **ID==31 already seen, not examined** ...and the same with 32, 33, 40, 41, 42, 43, 65, 66, 67, 68, etc. The numbers 31, 32, 33, 40, 41, 42, 43, 65, 66 .... are not valid values for ID, however they are the ones that need to be in SEEN in order to avoid the tests ``curRes = isValid(31)`` and ``curRes = isValid(32)`` etc to be performed. — eyquem, Aug 01 '11 at 07:57
you're right, I need to change the way I'm looking at problems. Thanks! — YSY, Aug 01 '11 at 14:17

Michael Dillon · Answer 3 · 2011-07-31T00:20:49.563

First of all, you should break up the job into two processes. One to determine valid ids, and the other to retrieve data.

The one that determines valid ids only needs to use http HEAD commands and can work faster than the one that retrieves pages.

For checking pages, after you check the id increments in diff, then add 18 to the id that caused you to start using diff. You can even record the ranges that were only partially checked by using diff, and come back later, at the end of the process, and check all of them as well.

If you can't skip any ids, then keep a cache of the last n ids that were checked where n is equal to len(diff). Use a ring buffer something like this:

nextelem = 0
...
# check before retrieving
if not id in ringbuff:
    #retrieve an id
    ringbuf[nextelem] = id
    nextelem += 1
    if nextelem > len[ringbuff]:
        nextelem = 0

...

On the surface of it, a simple loop like this should check all ids:

for id in xrange(1000000):
    checkpage(id)

This would check every possible page. But you want to read ahead when you get a hit, and also backtrack partially if I understand correctly. In any case, what you are fundamentally doing is changing the range of ids from the simple sequence returned by xrange, so I think you need to write a generator and do this instead:

for id in myrange(1000000):
    checkpage(id)

You might still want to use a ringbuffer, depending on what you do within that range of 18 possible additional hits. If you need to check all of the possibilities in diff and then go back to something less than the maximum element in diff, then the ring buffer would be useful in checkpage.

But the trick is to write myrange().

def myrange(maxnum):
    global hitfound
    global nextnum
    global diff
    curnum = 0
    while curnum < maxnum:
        yield curnum
        if hitfound:
            nextnum = curnum
            hitnum = curnum
            for e in diff:
                yield hitnum + e
            curnum = nextnum - 1
        curnum += 1

The three global variables let you influence the range of ids. If you set hitfound = True inside checkpage() whenever you get a good page, then you influence myrange to start applying the increments in diff. Then, you can set nextnum to influence where it starts incrementing after you start applying the diff increments. For instance you might decide to set it to 1 greater than the first (or last) hit that you find while checking diff increments. Or you could leave it alone, and use the ring buffer to ensure that you don't request any of the diff increment pages again.

I suggest that you extract the id incrementing logic and test it separately like my code above. Tweak the generator myrange() until it produces the right sequence, and then pop it into your web scraping program.

I wasn't clear enough, every request I make will return 200 OK, isValid() is checking if the page contains the data I need, I'm fetching the data in isValid() and updating a soup object if I find the data I need (there's only one request per ID to the server) about the diff part, I'm sorry but I can't see how it's done, I'm incrementing by 8 and checking, then by 18 and so on until I find the a URL which contain data, if there isn't a url that matches one of the diffs I'm increasing by 1 and then checking again, which results duplicate requests and inefficient progress — YSY, Jul 29 '11 at 22:18
I'm skipping the IDs in the diff until I can't, I've tried keeping a cache of the last IDs which were not valid but I still have a flaw in my design. if I'll use your offer I'll have a huge buffer to run on in the end, and I can't really see how this solves my problem? — YSY, Jul 29 '11 at 22:29
I think you need to closely read the code that I posted. A ring buffer is fixed size, i.e. it never grows. I think your problem is that you are doing a type of recursion, and if you cache the few most recent requests that will help you avoid retrieving an id twice. — Michael Dillon, Jul 29 '11 at 23:26
I'm not really using a recursion, I'm tried understanding how your code can help me and read about ring buffers, but I can't get it. I basically created a cache buffer for my requests (if tryID not in badTries), but I don't think I'm using it as I should. — YSY, Jul 29 '11 at 23:37
I've tried implementing your solution, where do you find the next id? lets say I'm starting with id = 1 and ringbuff is at the size of len(diff). if not id in ringbuff: getID(ID) - now I need to check and see if my id is valid or not, because I need to Increase the id by diff[i] each time if it's valid. I can't see how your solution answers the problem. this is the problem - I have a diff list [8,18,7,17,6,16,5,15] the program starts with getting an id, each time I need to check the id + diff[i], if diff[i] isn't a valid id I'm checking the next id which is id+diff[i+1]... — YSY, Jul 30 '11 at 16:46
, if there isn't a valid id on that iteration i'm increasing id by 1 and then I check again with id+diff[i]... diff[i+1], not in each iteration I have double ids which I've already checked in the previous iteration, I need to avoid checking those IDs and keep increasing the ID until I'll find the next valid ID. — YSY, Jul 30 '11 at 16:47

Python Crawler - need help with my algorithm

3 Answers3

Final Edit

Edit 1

I continue in the other answer, because the number of characters is limited in this one