2

i am trying to create a webpage scraper and I want to use BeautifulSoup to do so. I installed BeautifulSoup 4.3.2 as the website said it was compatible with python 3.x. I used

pip install beautifulsoup4

to install it. But when i run

from bs4 import BeautifulSoup
import requests

url = input("Enter a URL (start with www): ")

link = "http://" + url

data = requests.get(link).content

soup = BeautifulSoup(data)

for link in soup.find_all('a'):

   print(link.get('href'))

i get an error that says

Traceback (most recent call last):
File "/Users/user/Desktop/project.py", line 1, in <module>
  from bs4 import BeautifulSoup
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages   /bs4/__init__.py", line 30, in <module>
from .builder import builder_registry, ParserRejectedMarkup
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/bs4/builder /__init__.py", line 308, in <module>
from .. import _htmlparser
  ImportError: cannot import name _htmlparser
kvorobiev
  • 5,012
  • 4
  • 29
  • 35
heyItsTy1992
  • 33
  • 3
  • 8
  • Can you please try changing `data = link.text` to `data = requests.get(link).content` instead and try rerunning the code? The way I see it, you just straight up went to `link.text` without even invoking `Requests` to get the content of the URL. – WGS Mar 04 '14 at 21:47
  • I made the changes you stated above but still wouldn't get past the import error of BeautifulSoup. I understand it works in python 2.7 but im required to use python 3 and im not understanding why its not importing properly. – heyItsTy1992 Mar 04 '14 at 21:52
  • See [this](http://bugs.python.org/issue14538) for a related discussion on this bug. Have you tried on the *latest* Python 3.x release? I use 2.7, where this issue is non-existent so I can't really comment. Will check with Python 3.x though... – WGS Mar 04 '14 at 21:56

3 Answers3

1

I think there might be an error in the source file, specifically here:

  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/bs4/builder /__init__.py", line 308, in <module>
  from .. import _htmlparser

In my installation, line 308 of bs4/builder /__init__.py

  from . import _htmlparser

You could probably just fix it there and see if bs4 will successfully import. Not sure which version of bs4 you got installed, but mine is at 4.3.2, and the _htmlparser.py is also at bs4/builder

metatoaster
  • 17,419
  • 5
  • 55
  • 66
1

Just installed Python 3.x on my end and tested the latest download of BS4. Didn't work. However, a fix can be found here: https://github.com/il-vladislav/BeautifulSoup4 (credits to GitHub user Il Vladislav, whoever you are).

Download the zip, overwrite the bs4 folder inside your BeautifulSoup download, then reinstall it via python setup.py install. Works now on my end, as you can see in the screenshot below where an error is evident before working completely.

Code:

from bs4 import BeautifulSoup
import requests

url = input("Enter a URL (start with www): ")
link = "http://" + url
data = requests.get(link).content
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
   print(link.get('href'))

Screenshot:

enter image description here

Relevant SO topic found here, showing that BS4 is not totally compatible with Python 3.x yet (even after 2 years).

Community
  • 1
  • 1
WGS
  • 13,969
  • 4
  • 48
  • 51
  • I appreciate your help so much!! Im just not seeing the zip file you've mentioned. I see bs4,.gitattributes and .gitignore is it one of those? If you dont mind pointing me in the right direction i would appreciate it very much! Edit: Nevermind i found it :) – heyItsTy1992 Mar 04 '14 at 22:42
  • On the lower right side of that exact link there's a `Download ZIP` option. Should give you a `BeautifulSoup4-master` ZIP file. Extract it, go inside, and look for the `bs4` folder. Copy it and overwrite the `bs4` folder inside your original `BeautifulSoup4` download. Then reinstall `BeautifulSoup` via the command line. – WGS Mar 04 '14 at 22:45
  • Thank you! I got rid of the BS4 error. Now ive just gotta install requests i suppose because it says no module requests exists. I thought i installed it with pip though. – heyItsTy1992 Mar 04 '14 at 22:54
  • `pip` is one of the easier ways to install stuff, but it's not foolproof. Unless my installations require a load of dependencies (I'm looking at *you*, `matplotlib`), I install the good ol'-fashioned way. Also, yes, reinstall `Requests`. Make sure the correct version for `Python 3.x` is installed. :) Good luck! – WGS Mar 04 '14 at 22:56
  • And it works! I cant thank you enough! I really appreciate it. Have a good day! :) – heyItsTy1992 Mar 04 '14 at 23:01
0

I just edited the bs4/builder/_htmlparser.py so that

A) HTMLParseError wasn't imported

from html.parser import HTMLParser

B) The HTMLParseError class was defined

class HTMLParseError(Exception):
    """Exception raised for all parse errors."""

    def __init__(self, msg, position=(None, None)):
        assert msg
        self.msg = msg
        self.lineno = position[0]
        self.offset = position[1]

    def __str__(self):
        result = self.msg
        if self.lineno is not None:
            result = result + ", at line %d" % self.lineno
        if self.offset is not None:
            result = result + ", column %d" % (self.offset + 1)
        return result

This probably isn't the best since HTMLParserError isn't going to be raised. But! Your exception will just be uncaught and is unhandled anyways.

ACEnglish
  • 95
  • 2
  • 10