How to do a Python XPath case-insensitive search using lxml?

Question

I am trying to match for country or Country using lower-case function in XPath. translate is kinda messy, so using lower-case and my Python version 2.6.6 has XPath 2.0 support I believe since lower-case is only available in XPath 2.0.

How I can put lower-case to use in my case is what I am looking for. Hope the example is self explanatory. I am looking for ['USA', 'US'] as output (both countries in one go which can happen if lower-case evaluates Country and country to be the same).

HTML: doc.htm

<html>
    <table>
        <tr>
            <td>
                Name of the Country : <span> USA </span>
            </td>
        </tr>
        <tr>
            <td>
                Name of the country : <span> UK </span>
            </td>
        </tr>
</table>

Python :

import lxml.html as lh

doc = open('doc.htm', 'r')
out = lh.parse(doc)
doc.close()

print out.xpath('//table/tr/td[text()[contains(. , "Country")]]/span/text()')
# Prints : [' USA ']
print out.xpath('//table/tr/td[text()[contains(. , "country")]]/span/text()')
# Prints : [' UK ']

print out.xpath('//table/tr/td[lower-case(text())[contains(. , "country")]]/span/text()')
# Prints : [<Element td at 0x15db2710>]

Update :

out.xpath('//table/tr/td[text()[contains(translate(., "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") , "country")]]/span/text()')

Now the question remains, can I store the translate part as a global variable 'handlecase' and print that global variable whenever I do an XPath?

Something like this works :

handlecase = """translate(., "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")"""

out.xpath('//table/tr/td[text()[contains(%s , "country")]]/span/text()' % (handlecase))

But for sake of simplicity and readability, I want to run it like this :

out.xpath('//table/tr/td[text()[contains(handlecase , "country")]]/span/text()')

From [the lxml XPath documentation](http://lxml.de/xpathxslt.html): `lxml supports XPath 1.0`; thus, with lxml you are stuck with translate. — Martijn Pieters, Jun 27 '12 at 14:44
In that case, I am not sure why it isn't complaining when I use lower-case. I didn't have much luck with 'translate' either in this example scenario. Thank you! — ThinkCode, Jun 27 '12 at 14:46
[Possible duplicate](http://stackoverflow.com/questions/9804281/selectnodes-with-xpath-ignoring-cases/9805020#9805020) — JWiley, Jun 27 '12 at 16:02
Thanks for the link. This is more of a 'lower-case' discussion than translate. I actually got the translate to work by doing : out.xpath('//table/tr/td[lower-case(text())[contains( translate(., "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz") , "country")]]/span/text()') . Mods may close this if lower-case can't be applied in this case. Thank you! — ThinkCode, Jun 27 '12 at 16:17
But lxml DOES complain if you use `lower-case()`: "lxml.etree.XPathEvalError: Unregistered function". The code after "*I actually got the translate to work by doing..."* can't be right. — mzjn, Jun 27 '12 at 17:17
It works! handlecase = 'translate(., "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")' out.xpath('//table/tr/td[text()[contains(%s , "country")]]/span/text()' % (handlecase)) — ThinkCode, Jun 27 '12 at 17:22

score 5 · Answer 1 · answered Jun 27 '12 at 18:23

5

I believe the easiest thing to get what you want would be just writing an XPath Extension function.

By doing this, you could either write a lower-case() function, or a case insensitive search.

You can find the details here: http://lxml.de/extensions.html

answered Jun 27 '12 at 18:23

stranac

26,638
5
25
30

3

very nice answer, but you can't win without an example – mykhal Jul 25 '12 at 16:34
I wasn't trying to win, just to help. I thought of giving an example, but it just seemed to me that the link has enough examples. – stranac Jul 26 '12 at 02:19

Dimitre Novatchev · Accepted Answer · 2012-06-28T04:18:24.653

Use:

   //td[translate(substring(text()[1], string-length(text()[1]) - 9),
                  'COUNTRY :',
                  'country'
                  )
        =
         'country'
       ]
        /span/text()

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "//td[translate(substring(text()[1], string-length(text()[1]) - 9),
                  'COUNTRY :',
                  'country'
                  )
        =
         'country'
       ]
        /span/text()
       "/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<html>
        <table>
            <tr>
                <td>
                    Name of the Country : <span> USA </span>
                </td>
            </tr>
            <tr>
                <td>
                    Name of the country : <span> UK </span>
                </td>
            </tr>
        </table>
</html>

the XPath expression is evaluated and the selected two text-nodes are copied to the output:

 USA  UK

Explanation:

We use a specific variant of the XPath 1.0 expression that implements the XPath 2.0 standard function ends-with($text, $s): this is:

.....

$s = substring($text, string-length($text) - string-length($s) +1)

.2. The next step is, using the translate() function, to convert the ending 10-character long string to lowercase, eliminating any spaces or any ":" character.

.3. If the result is the string (all lowercase) "country", then we select the children text nodes (only one in this case) of the s=span child of this td.

How to do a Python XPath case-insensitive search using lxml?

2 Answers2