0

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.

    topLevelLinks = self.getAllUniqueLinks(baseUrl)
    listOfLinks = list(topLevelLinks)       

    length = len(listOfLinks)
    count = 0       

    while(count < length):

        twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
        twoListOfLinks = list(twoLevelLinks)
        twoCount = 0
        twoLength = len(twoListOfLinks)

        for twoLinks in twoListOfLinks:
            listOfLinks.append(twoLinks)

        count = count + 1

        while(twoCount < twoLength):
            threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])  
            threeListOfLinks = list(threeLevelLinks)

            for threeLinks in threeListOfLinks:
                listOfLinks.append(threeLinks)

            twoCount = twoCount +1



    print '--------------------------------------------------------------------------------------'
    #remove all duplicates
    finalList = list(set(listOfLinks))  
    print finalList

My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.

hackthisjay
  • 177
  • 4
  • 22
  • 1
    why don't you just use one array as an accumulator for all the links, and then just queue them in as you find more on the site? – HRÓÐÓLFR Jul 20 '11 at 22:45
  • Ok, great idea. How would I continue to find more. The above code only goes three level down in the page tree. I want to make this more dynamic then nested while loops – hackthisjay Jul 20 '11 at 22:51
  • 1
    you don't need to nest. run once through the html of the page, and have an array of all the links. then go through the next link. unless you want to do depth-first, in that case why don't you use a recursive function, though eventually it will overflow the stack... the web is big :O – HRÓÐÓLFR Jul 20 '11 at 22:57
  • Here is a solution with `lxml`: http://ms4py.org/2010/04/27/python-search-engine-crawler-part-1/ – schlamar Jul 21 '11 at 10:06

5 Answers5

4

The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:

http://pypi.python.org/pypi/spider.py/0.5

Even better, Google found this question already asked and answered here on StackOverflow:

Anyone know of a good Python based web crawler that I could use?

Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117
  • I did originally look at that post on StackOverflow and decided on using BeautifulSoup + urllib2. But the primary question is how would I make these nested while loops more dynamic. I'll look at the spider.py. Thanks for the informaiton – hackthisjay Jul 20 '11 at 22:55
2

If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:

self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
    if not frm.has_key('src'):
        continue
    src = frm[str('src')]
    #rest of URL processing here
except Exception, e:
    print  'Parser <frame> tag error: ', str(e)

for the frame tag. The same goes for "img src"and "a href" tags. I like the topic though - maybe its me who has sth wrong here... edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...

ejndzel
  • 53
  • 6
0

To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):

#!/usr/bin/env ruby

require 'open-uri'

hyperlinks = []
visited = []

# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
  links = []
  begin
    s = open(url).read
    s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
      link = $&.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
      link = url + link if link[0] == '/'

      # add to array if not already there
      links << link if not links =~ /url/
    end
  rescue
    puts 'Looks like we can\'t be here...'
  end
  links
end

print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
  break if hyperlinks.length == 0
  link = hyperlinks.shift
  next if visited.include? link
  visited << link
  puts "Connecting to #{link}..."
  links = get_hyperlinks(link)
  puts "Found #{links.length} links on #{link}..."
  hyperlinks = links + hyperlinks
  puts "Moving on with #{hyperlinks.length} links left...\n\n"
end

sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.

HRÓÐÓLFR
  • 5,842
  • 5
  • 32
  • 35
0

1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.

2) To handle multiple levels of links, we can use recursion.

def followAllLinks(self, from_where):
    for link in list(self.getAllUniqueLinks(from_where)):
        self.followAllLinks(link)

This does not handle cycles of links, but neither did the original approach. You can handle that by building a set of already-visited links as you go.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • How would I building a set of already-visited links using this method you submitted. This is on the right track. Thanks so much – hackthisjay Jul 20 '11 at 23:36
0

Use scrapy:

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

warvariuc
  • 57,116
  • 41
  • 173
  • 227