** Added a summary of the problem at the end of the post **
I've written a crawler that fetches and parses URLs.
in the first version, in order to get the next valid page I was increasing the URL ID and saved non-valid IDs to a file, the valid URLs were moved to a parser that parsed the content I need, after a while I noticed that most valid IDs has a returning subtrahend.
I made some statistics and got to that list of subtrahends - [8,18,7,17,6,16,5,15], ordered by the most repeating to the least.
so I changed my code to -
def checkNextID(ID):
numOfRuns = 0
global curRes, lastResult
while ID < lastResult:
try:
numOfRuns += 1
if numOfRuns % 10 == 0:
time.sleep(7) # sleep every 10 iterations
numOfRuns = 0
if isValid(ID + 8):
parseHTML(curRes, ID)
ID = ID + 8
elif isValid(ID + 18):
parseHTML(curRes, ID)
ID = ID + 18
elif isValid(ID + 7):
parseHTML(curRes, ID)
ID = ID + 7
elif isValid(ID + 17):
parseHTML(curRes, ID)
ID = ID + 17
elif isValid(ID+6):
parseHTML(curRes, ID)
ID = ID + 6
elif isValid(ID + 16):
parseHTML(curRes, ID)
ID = ID + 16
elif isValid(ID + 5):
parseHTML(curRes, ID)
ID = ID + 5
elif isValid(ID + 15):
parseHTML(curRes, ID)
ID = ID + 15
else:
if isValid(ID + 1):
parseHTML(curRes, ID)
ID = ID + 1
except Exception, e:
print "something went wrong: " + str(e)
isValid() is a function that gets an ID + one of the subtrahends and returns True if the url contains what I need and saves a soup object of the url to a global variable named - 'curRes', it returns False if the url doesn't contain the data I need and saves the id to 'badIdFile'.
parseHTML is a function that gets the soup object (curRes), parses the data I need and then saves the data to a csv, then returns True.
in a perfect world that piece of code would be everything I need in order to run on all valid IDs (there are about 400K in a range of 5M), it gave me better results in less time (x50 faster).
BUT, when getting to ranges that doesn't contain any valid URL, my code is very inefficient, it turns that I'm crawling mostly the same URLs over and over in each iteration, this is because I'm increasing the ID by one in order to keep progressing until I'll find the next valid URL and then I'm checking the ID + 8, then 18, 17 etc' which sometimes gives me the same URLs I checked in the previous iteration.
so I went and changed the code so it will keep a a set of non-valid URLs which I'll avoid checking again and I can't get it to work, I'm breaking my head for a few hours now, it's not working like it should.
this is my new function -
def checkNextID(ID):
runs = 0
badTries = set()
diff = [8,18,7,17,6,16,5,15,1]
global curRes, lastResult
while ID < lastResult:
if len(badTries) == 100: #every 100 non-valid IDs I can reset badTries(I think), if I've increased the ID 100 times it means I won't be checking previous IDs and I can start saving the non-valid IDs again.
badTries = set()
try:
for i in diff:
tryID = ID + i #tryID will be the ID + one of the subtrahends.
if tryID not in badTries: #if we haven't already tried that ID
if runs % 10 == 0:
time.sleep(6) #sleep every 10 requests
if isValid(tryID):
runs += 1
parseHTML(curRes, ID)
ID = tryID
badTries = set() #I can reset the badTries now, I'm moving forward.
break #will move to the next id.
else:
badTries.add(ID) #save the non-valid ID to the set
if i == 1: #it's the last iteration of the for loop, if the subtrahend is 1 and it's not valid I must increase the ID by 1 in order to go forward.
ID += 1
else:
ID += 1 #if the ID is not a valid ID and I've already checked it I must increase the ID by one or I'll get into an infinite loop.
except Exception, e:
print "something went wrong: " + str(e) + ' ID - ' + str(ID)
I'm saving each non-valid ID to a set, and before every call to isValid() I'm checking if I've already tried that ID, if I didn't, I'm calling isValid(), else, ID is increased by one.
that's how the bad ID file looks like -
513025328
513025317
513025327
513025316
513025326
513025312
513025320
513025330
513025319
513025329
513025318
513025328
513025317
513025327
513025313
513025321
513025331
513025320
513025330
513025319
513025329
513025318
513025328
513025314
513025322
513025332
513025321
513025331
513025320
513025330
513025319
513025329
513025315
513025323
513025333
513025322
513025332
513025321
513025331
513025320
513025330
513025316
513025324
513025334
513025323
513025333
513025322
513025332
513025321
513025331
513025317
513025325
513025335
513025324
513025334
513025323
513025333
513025322
513025332
513025318
513025326
513025336
513025325
513025335
513025324
513025334
513025323
513025333
513025319
513025327
513025337
513025326
513025336
513025325
513025335
513025324
513025334
513025320
513025328
513025338
513025327
513025337
513025326
513025336
513025325
513025335
513025321
513025329
513025339
513025328
513025338
513025327
513025337
513025326
513025336
513025322
513025330
513025340
513025329
513025339
513025328
513025338
513025327
513025337
513025323
513025331
513025341
513025330
513025340
513025329
513025339
513025328
513025338
513025324
513025332
513025342
513025331
513025341
513025330
513025340
513025329
513025339
513025325
513025333
513025343
513025332
513025342
513025331
513025341
513025330
513025340
513025326
513025334
513025344
513025333
513025343
513025332
513025342
513025331
513025341
513025327
513025335
513025345
513025334
513025344
513025333
513025343
513025332
513025342
513025328
513025336
513025346
513025335
513025345
513025334
513025344
513025333
513025343
513025329
513025337
513025347
513025336
513025346
513025335
513025345
513025334
513025344
513025330
513025338
513025348
513025337
513025347
513025336
513025346
513025335
513025345
513025331
513025339
513025349
513025338
513025348
513025337
513025347
513025336
513025346
513025332
513025340
513025350
513025339
513025349
513025338
513025348
513025337
513025347
513025333
513025341
513025351
513025340
513025350
513025339
513025349
513025338
513025348
513025334
513025342
513025352
513025341
513025351
513025340
513025350
513025339
513025349
513025335
513025343
513025353
513025342
513025352
513025341
513025351
513025340
513025350
513025336
513025344
513025354
513025343
513025353
513025342
513025352
513025341
513025351
513025337
513025345
513025355
513025344
513025354
513025343
513025353
513025342
513025352
513025338
513025346
513025356
513025345
513025355
513025344
513025354
513025343
513025353
513025339
513025347
513025357
513025346
513025356
513025345
513025355
513025344
513025354
513025340
513025348
513025358
513025347
513025357
513025346
513025356
513025345
513025355
513025341
513025349
513025359
513025348
513025358
513025347
513025357
513025346
513025356
513025342
513025350
513025360
513025349
513025359
513025348
513025358
513025347
513025357
513025343
513025351
513025361
513025350
513025360
513025349
513025359
513025348
513025358
513025344
513025352
513025362
513025351
513025361
513025350
513025360
513025349
513025359
513025345
513025353
513025363
513025352
513025362
513025351
513025361
513025350
513025360
513025346
513025354
513025364
513025353
513025363
513025352
513025362
513025351
513025361
513025347
513025355
513025365
513025354
513025364
513025353
513025363
513025352
513025362
513025348
513025356
513025366
513025355
513025365
513025354
513025364
513025353
513025363
513025349
513025357
513025367
513025356
513025366
513025355
513025365
513025354
513025364
513025350
513025358
513025368
513025357
513025367
513025356
513025366
513025355
513025365
513025351
513025359
513025369
513025358
513025368
513025357
513025367
513025356
513025366
513025352
513025360
513025370
513025359
513025369
513025358
513025368
513025357
513025367
513025353
513025361
513025371
513025360
513025370
513025359
513025369
513025358
513025368
513025354
513025362
513025372
513025361
513025371
513025360
513025370
513025359
513025369
513025355
513025363
513025373
513025362
513025372
513025361
513025371
513025360
513025370
513025356
513025364
513025374
513025363
513025373
513025362
513025372
513025361
513025371
513025357
513025365
513025375
513025364
513025374
513025363
513025373
513025362
513025372
513025358
513025366
513025376
513025365
513025375
513025364
513025374
513025363
513025373
513025359
513025367
513025377
513025366
513025376
513025365
513025375
513025364
513025374
513025360
513025368
513025378
513025367
513025377
513025366
513025376
513025365
513025375
513025361
513025369
513025379
513025368
513025378
513025367
513025377
513025366
513025376
513025362
513025370
513025380
513025369
513025379
513025368
513025378
513025367
513025377
513025363
513025371
513025381
513025370
513025380
513025369
513025379
513025368
513025378
513025364
513025372
513025382
513025371
513025381
513025370
513025380
513025369
513025379
513025365
513025373
513025383
513025372
513025382
513025371
513025381
513025370
513025380
513025366
513025374
513025384
513025373
513025383
513025372
513025382
513025371
513025381
513025367
513025375
513025385
513025374
513025384
513025373
513025383
513025372
513025382
513025368
513025376
513025386
513025375
513025385
513025374
513025384
513025373
513025383
513025369
513025377
513025387
513025376
513025386
513025375
513025385
513025374
513025384
513025370
513025378
513025388
513025377
513025387
513025376
513025386
513025375
513025385
513025371
513025379
513025389
513025378
513025388
513025377
513025387
513025376
513025386
513025372
513025380
513025390
513025379
513025389
513025378
513025388
513025377
513025387
513025373
513025381
513025391
513025380
513025390
513025379
as you can see, it's not working, I know I have a flaw in the whole design but I can't find it, I would really appreciate your help.
a summary of the problem -
I have a diff list [8,18,7,17,6,16,5,15] the program starts with getting an id, each time I need to check the next id which is - id + diff[i] (i=0) if (id + diff[i]) isn't a valid id I'm checking the next id which is (id + diff[i+1]).
if there isn't a valid id on that iteration (id + diff[i..n]) I'm increasing id by 1, and check if id+1 is valid id, if not I'm checking again with id + diff[i..n], until I'll find the next valid id.
in each iteration I'm checking ids I've already checked in the previous iteration (which costs a lot of time and is inefficient), I need to avoid checking the IDs I've already checked and keep increasing the ID until I'll find the next valid ID.
as for now, if the id = 1 (and it's a valid id) and diff = [8,18,7,17,6,16,5,15].
first iteration will look like (I'm marking with bold the id's which I could avoid checking) - first - id = 1
9, 19, 8, 18, 7, 17, 6, 16, 2
second - id = 2
10, 20, 9, 19, 8, 18, 7, 17, 3
third - id = 3
11, 21, 10, 20, 9, 19, 8, 18, 4
fourth - id = 4
12, 22 - BINGO, next valid ID is 22!
that cost me 29 requests, instead of - 17, and that's a small example, I have ranges that are 300-600 ids from the last valid id.
I can't get my code to avoid checking previously checked ids with a smart and efficient way.
Thanks!