So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.