1

I am trying to build a basic web crawler using Beautiful soup in python 2.7. Here is my code:

import re
import httplib
import urllib2
from urlparse import urlparse
from bs4 import BeautifulSoup

regex = re.compile(
        r'^(?:http|https)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

def isValidUrl(url):
    if regex.match(url) is not None:
        return True;
    return False

def crawler(SeedUrl):
    tocrawl=[SeedUrl]
    crawled=[]
    while tocrawl:
        page=tocrawl.pop()
        print 'Crawled:'+page
        pagesource=urllib2.urlopen(page)
        s=pagesource.read()
        soup=BeautifulSoup.BeautifulSoup(s)
        links=soup.findAll('a',href=True)        
        if page not in crawled:
            for l in links:
                if isValidUrl(l['href']):
                    tocrawl.append(l['href'])
            crawled.append(page)   
    return crawled

crawler('https://www.google.co.in/?gfe_rd=cr&ei=SfWxVs65JK_v8we9zrj4AQ&gws_rd=ssl')

I am getting the error:

Crawled:https://www.google.co.in/?gfe_rd=cr&ei=SfWxVs65JK_v8we9zrj4AQ&gws_rd=ssl Traceback (most recent call last): File "web_crawler_python_2.py", line 38, in crawler('https://www.google.co.in/?gfe_rd=cr&ei=SfWxVs65JK_v8we9zrj4AQ&gws_rd=ssl') File "web_crawler_python_2.py", line 29, in crawler soup=BeautifulSoup.BeautifulSoup(s) AttributeError: type object 'BeautifulSoup' has no attribute 'BeautifulSoup'

I tried a lot but can't seem to debug it. Can anyone point me to the problem. (As a side note, I know that many websites do not allow crawling, but I'm just doing it to learn).

Thanks, any help would be appreciated.

Source I have used for the code: simple web crawler

Community
  • 1
  • 1
Mahatma Gandhi
  • 41
  • 1
  • 1
  • 6

1 Answers1

2

This class hasn't attribute BeautifulSoup. I don't know why you used it. Example from documentation:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

You need replace:

BeautifulSoup.BeautifulSoup

to

BeautifulSoup
JRazor
  • 2,707
  • 18
  • 27
  • Thank you! It worked! Can you please tell me what was the problem ? – Mahatma Gandhi Feb 03 '16 at 12:57
  • 1
    @MahatmaGandhi: the usage `soup=BeautifulSoup.BeautifulSoup(s)` is good for version 3; for version 4, the usage is: `soup=bs4.BeautifulSoup(s)` – Quinn Feb 03 '16 at 15:07