0

I am trying to implement a simple web crawler and I have already written a simple code to start off : There are two modules fetcher.py and crawler.py. Here are the files :

fetcher.py :

    import urllib2
    import re
    def fetcher(s):
    "fetch a web page from a url"

    try:
            req = urllib2.Request(s)
            urlResponse = urllib2.urlopen(req).read()
    except urllib2.URLError as e:
            print e.reason
            return

    p,q = s.split("//")
    d = q.split("/")
    fdes = open(d[0],"w+")
    fdes.write(str(urlResponse))
    fdes.seek(0)
    return fdes



    if __name__ == "__main__":
    defaultSeed = "http://www.python.org"
    print fetcher(defaultSeed)

crawler.py :

from bs4 import BeautifulSoup
import re
from fetchpage import fetcher    

usedLinks = open("Used","a+")
newLinks = open("New","w+")

newLinks.seek(0)

def parse(fd,var=0):
        soup = BeautifulSoup(fd)
        for li in soup.find_all("a",href=re.compile("http")):
                newLinks.seek(0,2)
                newLinks.write(str(li.get("href")).strip("/"))
                newLinks.write("\n")

        fd.close()
        newLinks.seek(var)
        link = newLinks.readline().strip("\n")

        return str(link)


def crawler(seed,n):
        if n == 0:
                usedLinks.close()
                newLinks.close()
                return
        else:
                usedLinks.write(seed)
                usedLinks.write("\n")
                fdes = fetcher(seed)
                newSeed = parse(fdes,newLinks.tell())
                crawler(newSeed,n-1)

if __name__ == "__main__":
        crawler("http://www.python.org/",7)

The problem is that when i run crawler.py it works fine for the first 4-5 links and then it hangs and after a minute gives me the following error :

[Errno 110] Connection timed out
   Traceback (most recent call last):
  File "crawler.py", line 37, in <module>
    crawler("http://www.python.org/",7)
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
 File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 33, in crawler
    newSeed = parse(fdes,newLinks.tell())
  File "crawler.py", line 11, in parse
    soup = BeautifulSoup(fd)
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
    self.builder.prepare_markup(markup, from_encoding))
  File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in     prepare_markup
    dammit = UnicodeDammit(markup, try_encodings, is_html=True)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 191, in __init__
    self._detectEncoding(markup, is_html)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 362, in _detectEncoding
    xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer

Can anyone help me with this, I am very new to python and I am unable to find out why does it say connection timed out after some time ?

Deepankar Bajpeyi
  • 5,661
  • 11
  • 44
  • 64

2 Answers2

0

A Connection Timeout is not specific to python, it just means that you made a request to the server, and the server did not respond within the amount of time that your application was willing to wait.

On very possible reason that this could occur is that python.org may have some mechanism to detect when it is getting multiple requests from a script, and probably just completely stops serving pages after 4-5 requests. There is nothing you can really do to avoid this other than trying out your script on a different site.

Gordon Bailey
  • 3,881
  • 20
  • 28
  • yes . and It gets stuck after the exact number of links everytime.I ran the same script with other website and it did not give me this problem. :) Thanks for the response . – Deepankar Bajpeyi Jan 24 '13 at 04:23
0

You could try using proxies to avoid getting detected on multiple requests as stated above. You might want to check out this answer to get an idea on how to send urllib requests with proxies: How to open website with urllib via Proxy - Python

Community
  • 1
  • 1
CounterFlame
  • 1,612
  • 21
  • 31