6

Here is an example web page I am trying to get data from. http://www.makospearguns.com/product-p/mcffgb.htm

The xpath was taken from chrome development tools, and firepath in firefox is also able to find it, but using lxml it just returns an empty list for 'text'.

from lxml import html
import requests

site_url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

page = requests.get(site_url)
tree = html.fromstring(page.text) 
text = tree.xpath(xpath)

Printing out the tree text with

print(tree.text_content().encode('utf-8'))

shows that the data is there, but it seems the xpath isn't working to find it. Is there something I am missing? Most other sites I have tried work fine using lxml and the xpath taken from chrome dev tools, but a few I have found give empty lists.

bltpyro
  • 320
  • 3
  • 13
  • Please, care about those few imports needed to run your code. – Jan Vlcinsky May 27 '14 at 23:09
  • 2
    the `tbody` that your browser dev tools show you is *implicit*, it exists in the DOM but not in the actual page source. see http://stackoverflow.com/questions/938083/why-do-browsers-insert-tbody-element-into-table-elements – roippi May 27 '14 at 23:49
  • @Jan You are right, I should have added imports. Especially with python since you can say import x as whatever. Done – bltpyro May 28 '14 at 01:50

3 Answers3

15

1. Browsers frequently change the HTML

Browsers quite frequently change the HTML served to it to make it "valid". For example, if you serve a browser this invalid HTML:

<table>
  <p>bad paragraph</p>
  <tr><td>Note that cells and rows can be unclosed (and valid) in HTML
</table>

To render it, the browser is helpful and tries to make it valid HTML and may convert this to:

<p>bad paragraph</p>
<table>
  <tbody>
    <tr>
      <td>Note that cells and rows can be unclosed (and valid) in HTML</td>
    </tr>
  </tbody>
</table>

The above is changed because <p>aragraphs cannot be inside <table>s and <tbody>s are recommended. What changes are applied to the source can vary wildly by browser. Some will put invalid elements before tables, some after, some inside cells, etc...

2. Xpaths aren't fixed, they are flexible in pointing to elements.

Using this 'fixed' HTML:

<p>bad paragraph</p>
<table>
  <tbody>
    <tr>
      <td>Note that cells and rows can be unclosed (and valid) in HTML</td>
    </tr>
  </tbody>
</table>

If we try to target the text of <td> cell, all of the following will give you approximately the right information:

//td
//tr/td
//tbody/tr/td
/table/tbody/tr/td
/table//*/text()

And the list goes on...

however, in general browser will give you the most precise (and least flexible) XPath that lists every element from the DOM. In this case:

/table[0]/tbody[0]/tr[0]/td[0]/text()

3. Conclusion: Browser given Xpaths are usually unhelpful

This is why the XPaths produced by developer tools will frequently give you the wrong Xpath when trying to use the raw HTML.

The solution, always refer to the raw HTML and use a flexible, but precise XPath.

Examine the actual HTML that holds the price:

<table border="0" cellspacing="0" cellpadding="0">
    <tr>
        <td>
            <font class="pricecolor colors_productprice">
                <div class="product_productprice">
                    <b>
                        <font class="text colors_text">Price:</font>
                        <span itemprop="price">$149.95</span>
                    </b>
                </div>
            </font>
            <br/>
            <input type="image" src="/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/>
        </td>
    </tr>
</table>

If you want the price, there is actually only one place to look!

//span[@itemprop="price"]/text()

And this will return:

$149.95
  • Thanks. The reason I wanted to use the developer tools was for more of a foolproof way to get the path, ie a simple copy paste without someone needing to know anything about xpaths to come up with a 'good' xpath. I guess if I want something more foolproof I would need to write my own web plugin. – bltpyro May 28 '14 at 17:39
  • Are there not existing tools which will work on *unmodified* HTML/XML ? – carl crott Aug 29 '16 at 16:15
3

The xpath is simply wrong

Here is snippet from the page:

<form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();">
      <img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br />
      <table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent">
        <tr>
          <td colspan="2" class="vCSS_breadcrumb_td"><b>
&nbsp; 
<a href="http://www.makospearguns.com/">Home</a> > 

You can see, that element with id being "v65-product-parent" is of typetableand has subelementtr`.

There can be only one element with such id (otherwise it would be broken xml).

The xpath is expecting tbody as child of given element (table) and there is none in whole page.

This can be tested by

>>> "tbody" in page.text
False

How Chrome came to that XPath?

If you simply download this page by

$ wget http://www.makospearguns.com/product-p/mcffgb.htm

and review content of it, it does not contain a single element named tbody

But if you use Chrome Developer Tools, you find some.

How it comes here?

This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.

How to get content of page dynamically modified within browser?

You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.

byselenium.py

from selenium import webdriver
from lxml import html

url = "http://www.makospearguns.com/product-p/mcffgb.htm"
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source
print "test tbody", "tbody" in html_source

tree = html.fromstring(html_source) 
text = tree.xpath(xpath)
print text

what prints

$ python byselenimum.py 
test tbody True
['$149.95']

Conclusions

Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.

Jan Vlcinsky
  • 42,725
  • 12
  • 101
  • 98
  • I just now went to the page and inspected it. When I right click on the span with the price and select "Copy XPath", this is exactly what it gives me. And when I plug that copied xpath into firepath, it shows me the correct part of the page. If the path is simply wrong than why did that work? – bltpyro May 28 '14 at 01:24
  • -1 because this "It gets generated dynamically by JavaScript after it is loaded into browser" is *wrong*. –  May 28 '14 at 05:57
  • @LegoStormtroopr Thanks Lego. I have learned something more today. – Jan Vlcinsky May 28 '14 at 06:08
  • @JanVlcinsky Check my answer. The page is altered by the browser to massage it into the DOM, *before* any Javascript is called. –  May 28 '14 at 06:09
  • @LegoStormtroopr Corrected my answer (feel free to make final touch, if you like). – Jan Vlcinsky May 28 '14 at 06:15
  • Thanks for explaining the differences. Also, thanks for the hint on using selenium. Since I want to use a generated xpath that doesn't require any user thought, this may be the better way for me to go about it. – bltpyro May 28 '14 at 17:47
1

I had a similar issue (Chrome inserting tbody elements when you do Copy as XPath). As others answered, you have to look at the actual page source, though the browser-given XPath is a good place to start. I've found that often, removing tbody tags fixes it, and to test this I wrote a small Python utility script to test XPaths:

#!/usr/bin/env python
import sys, requests
from lxml import html
if (len(sys.argv) < 3):
     print 'Usage: ' + sys.argv[0] + ' url xpath'
     sys.exit(1)
else:
    url = sys.argv[1]
    xp = sys.argv[2]

page = requests.get(url)
tree = html.fromstring(page.text)
nodes = tree.xpath(xp)

if (len(nodes) == 0):
     print 'XPath did not match any nodes'
else:
     # tree.xpath(xp) produces a list, so always just take first item
     print (nodes[0]).text_content().encode('ascii', 'ignore')

(that's Python 2.7, in case the non-function "print" didn't give it away)

Chirael
  • 3,025
  • 4
  • 28
  • 28