8

I am trying to use beautifulsoup4 to parse a series of webpages written in XHTML. I am assuming that for best results, I should pair with an xml parser, and the only one supported by beautifulsoup to my knowledge is lxml.

However, when I try to run the following as per the beautifuloup documentation:

import requests

from bs4 import BeautifulSoup 
r = requests.get(‘hereiswhereiputmyurl’)
soup = BeautifulSoup(r.content, ‘xml’)

it results in the following error:

FeatureNotFound: Couldn't find a tree builder with the features you    
requested: xml. Do you need to install a parser library?

Its driving me crazy. I have found record of two other users who posted the same problem

Here How to re-install lxml?

and Here bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

I used this post (see link directly below this line) to reinstall and update lxml and also updated beautiful soup, but I am still getting the error. Installing lxml, libxml2, libxslt on Windows 8.1

Beautifulsoup is working otherwise because I ran the following code and it presented me with its usual wall of markup language soup = BeautifulSoup(r.content, ‘html.parser’)

Here are my specs Windows 8.1 Python 3.5.2 I use the spyder ide in Anaconda 3 to run my code (which admittedly, I do not know much about)

I'm sure its a messup that a beginner would do because as I stated before I have very little programming experience.

How can i resolve this issue, or if it is a known bug, would you guys recommend that I just use lxml by itself to scrape the data.

Community
  • 1
  • 1
Kevin
  • 391
  • 3
  • 6
  • 22
  • 2
    What happens when you `import lxml`? – DeepSpace Jul 28 '16 at 06:24
  • 1
    How about `soup = BeautifulSoup(r.content, 'lxml')` ? – har07 Jul 28 '16 at 06:38
  • If you know xpath and or css I would use lxml over bs4 but your issue is most likely you have installed lxml for one version of python and you are using another. – Padraic Cunningham Jul 28 '16 at 07:13
  • Thanks for the input so far. I can address all points thus far. DeepSpace, when i import lxml by itself it imports fine with no errors. har07 - I have tried that one as well but i still get the same result as in my original problem... Padraic Cunningham--- Would there be a way that i could check this because i installed beautiful soup from pip and then i installed lxml using the method in the link in my post. – Kevin Jul 28 '16 at 08:25
  • Lxml file that I downloaded from the link above was lxml-3.6.1-cp35-cp35m-win_amd64.whl. It was the only one that worked and i am assuiming that the cp is referring to python 3.5 but I could be wrong. Its just frustrating because there are other posts on here about the matter and nobody has been able to find a solution yet. Does that mean noone has been able to parse XHTML with bs4 for over two years now? Any more help would be greatly appreciated. Thanks so far guys! – Kevin Jul 29 '16 at 05:48

3 Answers3

4

This is a pretty old post, but I had this problem today and found the solution. You need to have lxml installed. Open the terminal and type

pip3 install lxml

Now restart the dev environment (VS Code, Jupyter notebook or whatever) and it should work.

Eeshaan
  • 1,557
  • 1
  • 10
  • 22
1

I think the problem is r.content. Normally it gives the raw content of the response, which is not necessarily an HTML page, it can be json, etc.
Try feeding r.text to soup.

soup = BeautifulSoup(r.text, ‘lxml’)

Better:

r.encoding='utf-8'

then

page = r.text

soup = BeautifulSoup(page, 'lxml')

if you are going to parse xml, you can use 'lxml-xml' as parser.

Kaan E.
  • 515
  • 4
  • 16
0

JUST import lxml and then use the parser command. In 2021 if you install lxml using pip, for some reason pycharm still needs to install it every time you write a new program