parsing html with lxml - how do I specify a 1 - 3 digit wildcard to make my code less brittle?

Question

I am trying to scrape the "sector" and "industry" fields from yahoo finance using xml.

I've noticed that the href url is consistently http://biz.yahoo.com/ic/xyz.html, where xyz are numbers.

Could you please suggest ways to include a wildcard of 1 or more digits? I have tried several methods based on Google and stack searches, but nothing has worked.

import lxml.html
url = 'http://finance.yahoo.com/q?s=AAPL'
root = lxml.html.parse(url).getroot()
for a in root.xpath('//a[@href="http://biz.yahoo.com/ic/' + 3 digit integer wildcard "     +'.html"]')
    print a.text

score 5 · Accepted Answer · edited Jul 30 '12 at 04:13

Pure XPath 1.0 solution (no extension functions):

//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
  and
    substring(@href, string-length(@href)-4) = '.html'
  and
    string-length
      (substring-before
          (substring-after(@href, 'http://biz.yahoo.com/ic/'), 
           '.')
      ) = 3
  and
    translate(substring-before
               (substring-after(@href, 'http://biz.yahoo.com/ic/'), 
                '.'),
              '0123456789',
              ''
              )
     = ''
   ]

This XPath expression can be "read in English" like this:

Select any a in the document, the string value of whose href attribute starts with the string "'http://biz.yahoo.com/ic/" and ends with the string ".html", and the substring that is between the start and end substrings has length of 3, and this same substring consists only of digits.

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
      and
        substring(@href, string-length(@href)-4) = '.html'
      and
        string-length
          (substring-before
              (substring-after(@href, 'http://biz.yahoo.com/ic/'),
               '.')
          ) = 3
      and
        translate(substring-before
                   (substring-after(@href, 'http://biz.yahoo.com/ic/'),
                    '.'),
                  '0123456789',
                  ''
                  )
         = ''
       ]
   "/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<html>
  <body>
    <a href="http://biz.yahoo.com/ic/123.html">Link1</a>
    <a href="http://biz.yahoo.com/ic/1234.html">Incorrect</a>
    <a href="http://biz.yahoo.com/ic/x23.html">Incorrect</a>
    <a href="http://biz.yahoo.com/ic/621.html">Link2</a>
  </body>
</html>

the XPath expression is evaluated and the selected nodes are copied to the output:

<a href="http://biz.yahoo.com/ic/123.html">Link1</a>
<a href="http://biz.yahoo.com/ic/621.html">Link2</a>

As we see, only the correct, wanted a elements have been selected.

Thank you very much for this textbook style answer. Very informative! — snakesNbronies, Apr 28 '12 at 06:14

score 1 · Answer 2 · edited May 23 '17 at 11:55

root.xpath(r'''//a[re:match(@href, "http://biz\.yahoo\.com/ic/[0-9]{1,3}\.html")]''',
           namespaces={'re': 'http://exslt.org/regular-expressions'})

The XPath expression matches all a tags for which the regular expression matches. re:match will return true if the href attribute starts with http://biz.yahoo.com/ic/, continues with 1 to 3 digits ([0-9]{1,3}) and ends with .html.

I used \. because . would match any character, but by putting a backslash in front of it, it's treated like a plain dot.

r'''...''' means that the string is raw (Python will not interpret it in any way, it won't care about \ for example) and it can even contain ' because the delimiters are '''.

Credit goes to another answer from Stack Overflow.

Thank you very much as well! Could you also explain this as I am quite new to Python and programming in general - as in, I have only been programming for 2 months. — snakesNbronies, Apr 28 '12 at 06:19

parsing html with lxml - how do I specify a 1 - 3 digit wildcard to make my code less brittle?

2 Answers2