can we use XPath with BeautifulSoup?

Question

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)

soup.findAll('td',attrs={'class':'empformbody'})

Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.

Martijn Pieters · Accepted Answer · 2019-09-30T20:43:00.847

Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

Thanks very much Pieters, i got two informations from ur code,1. A clarification that we can't use xpath with BS 2.A nice example on how using lxml. Can we see it on a particular documentation that "we can't implement xpath using BS in written form", because we should show some proof to someone those who ask for clarification right? — Shiva Krishna Bavandla, Jul 13 '12 at 08:01
It's hard to prove a negative; the [BeautifulSoup 4 documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) has a search function and there are no hits for 'xpath'. — Martijn Pieters, Jul 13 '12 at 08:06
I tried running your code above but got an error "name 'xpathselector' is not defined" — Zvi, Jan 11 '21 at 14:29
@Zvi the code doesn’t define an Xpath selector; I meant it to be read as “use your own XPath expression *here*”. — Martijn Pieters, Jan 12 '21 at 23:55

score 196 · Answer 2 · answered Jul 13 '12 at 11:44

196

I can confirm that there is no XPath support within Beautiful Soup.

answered Jul 13 '12 at 11:44

Leonard Richardson

3,994
2
17
10

129

Note: Leonard Richardson is the author of Beautiful Soup, as you'll see if you click through to his user profile. – senshin May 14 '14 at 05:30
40

It would be very nice to be able to use XPATH within BeautifulSoup – DarthOpto Dec 02 '14 at 20:42
7

So what is the alternative? – static_rtti May 08 '17 at 11:04
18

@leonard-richardson It's 2021, are you still confirming that BeautifulSoup *STILL* does not have xpath support? – mshaffer Apr 05 '21 at 07:38

wordsforthewise · Answer 3 · 2019-09-30T19:00:50.700

62

As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

I used this as a reference.

edited Sep 30 '19 at 19:00

answered Jan 06 '17 at 21:38

wordsforthewise

13,746
5
87
117

One warning: I've noticed if there is something outside the root (like a \n outside the outer tags), then referencing xpaths by the root will not work, you have to use relative xpaths. https://lxml.de/xpathxslt.html – wordsforthewise Sep 06 '18 at 13:38
*Martijn's code no longer works properly (it is 4+ years old by now...), the etree.parse() line prints to the console and doesn't assign the value to the tree variable.* That's quite a claim. I certainly can't reproduce that, and it would *not make any sense*. Are you sure you are using Python 2 to test my code with, or have translated the `urllib2` library use to Python 3 `urllib.request`? – Martijn Pieters Sep 29 '19 at 14:47
Yeah, that may be the case that I used Python3 when writing that and it didn't work as expected. Just tested and yours works with Python2, but Python3 is much preferred as 2 is being sunset (no longer officially supported) in 2020. – wordsforthewise Sep 30 '19 at 18:59
absolutely agree, but the question here *uses Python 2*. – Martijn Pieters Sep 30 '19 at 20:22

score 24 · Answer 4 · edited Feb 03 '17 at 19:21

24

BeautifulSoup has a function named findNext from current element directed childern,so:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')

Above code can imitate the following xpath:

div[class=class_value]/div[id=id_value]

edited Feb 03 '17 at 19:21

657784512

613
7
14

answered Jul 09 '14 at 13:11

user3820561

351
2
3

score 22 · Answer 5 · answered May 06 '20 at 18:24

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[@id="BGINP01_S1"]/section/div/font/text()')

Above used the combination of Soup object with lxml and one can extract the value using xpath

score 4 · Answer 6 · answered Aug 18 '18 at 10:15

4

when you use lxml all simple:

tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[@class="shared-components"]/@href')

but when use BeautifulSoup BS4 all simple too:

first remove "//" and "@"
second - add star before "="

try this magic:

soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')

as you see, this does not support sub-tag, so i remove "/@href" part

answered Aug 18 '18 at 10:15

Oleksandr Panchenko

57
3

`select()` is for CSS selectors, it's not XPath at all. _as you see, this does not support sub-tag_ While I'm not sure if that was true at the time, it certainly isn't now. – AMC May 11 '20 at 18:41

Nikola · Answer 7 · 2021-08-17T08:39:05.293

3

I've searched through their docs and it seems there is no XPath option.

Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.

edited Aug 17 '21 at 08:39

answered Jul 13 '12 at 07:30

Nikola

14,888
21
101
165

yes actually until now i used scrapy which uses xpath to fetch the data inside tags.Its very handy and easy to fetch data, but i got a need to do the same with beautifulsoup so looking forward in to it. – Shiva Krishna Bavandla Jul 13 '12 at 07:46

dabingsou · Answer 8 · 2019-12-08T03:32:02.480

Maybe you can try the following without XPath

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<html>
<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

David A · Answer 9 · 2017-12-15T21:46:08.533

-1

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.

Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.

from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

edited Dec 15 '17 at 21:46

answered Dec 15 '17 at 08:35

David A

47
7

1

I believe this only finds the child elements. XPath is another thing? – robertspierre Jan 03 '19 at 15:34
This is just the regular navigation provided by BeautifulSoup. – sourcream Jun 26 '23 at 18:58

score -5 · Answer 10 · answered Mar 01 '21 at 00:26

-5

use soup.find(class_='myclass')

answered Mar 01 '21 at 00:26

Γιωργος Αλεξανδρου

1

can we use XPath with BeautifulSoup?

10 Answers10

Linked

Related