6

I have a lxml etree HTMLParser object that I'm trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(') or double-quotes(") and I've exhausted all my options.

Here's a sample object I created

parser = etree.HTMLParser()
tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser)

Here is the snippet of code and then different variations of the variable being read in

   def getXpath(self)
     xpath += 'starts-with(., \'' + self.text + '\') and '
     xpath += ('count(@*)=' + str(attrsCount) if self.exactMatch else "1=1") + ']'

self.text is basically the expected text of the tag, in this case: Here is my 'test' "string"

this fails when i try to use the xpath method of the HTMLParser object

tree.xpath(self.getXpath())

Reason is because the xpath that it gets is this '/html/body/p[starts-with(.,'Here is my 'test' "string"') and 1=1]'

How can I properly escape the single and double quotes from the self.text variable? I've tried triple quoting, wrapping self.text in repr(), or doing a re.sub or string.replace escaping ' and " with \' and \"

Bob Evans
  • 616
  • 6
  • 18

3 Answers3

1

The solution is applicable If u r using python lxml. Its better to leave the escaping for lxml. We can do this by using lxmlvariables. Suppose We have xpath as below:

//tagname[text='some_text']`

If some_text has both single and double quotes, then it causes "Invalid Predicate error". Neither escaping work for me nor triple quotes. Because xml won't accept triple quotes.

Solution worked for me is lxml variables.

We convert the xpath as below:

//tagname[text = $var]

Then execute

find = etree.XPath(xpath)

Then evaluate these variable to its value

elements = find(root, {'var': text})
Patryk Brejdak
  • 1,571
  • 14
  • 26
Hemanth Sharma
  • 309
  • 3
  • 6
1

there are more options to choose from, especially the """ and ''' might be what you want.

s = "a string with a single ' quote"
s = 'a string with a double " quote'
s = """a string with a single ' and a double " quote"""
s = '''another string with those " quotes '.'''
s = r"raw strings let \ be \"
s = r'''and can be added \ to " any ' of """ those things'''
s = """The three-quote-forms
       may contain
       newlines."""
towi
  • 21,587
  • 28
  • 106
  • 187
1

According to what we can see in Wikipedia and w3 school, you should not have ' and " in nodes content, even if only < and & are said to be stricly illegal. They should be replaced by corresponding "predefined entity references", that are &apos; and &quot;.

By the way, the Python parsers I use will take care of this transparently: when writing, they are replaced; when reading, they are converted.

After a second reading of your answer, I tested some stuff with the ' and so on in Python interpreter. And it will escape everything for you!

>>> 'text {0}'.format('blabla "some" bla')
'text blabla "some" bla'
>>> 'ntsnts {0}'.format("ontsi'tns")
"ntsnts ontsi'tns"
>>> 'ntsnts {0}'.format("ontsi'tn' \"ntsis")
'ntsnts ontsi\'tn\' "ntsis'

So we can see that Python escapes things correctly. Could you then copy-paste the error message you get (if any)?

Joël
  • 2,723
  • 18
  • 36
  • I see, the error I'm getting is from lxml: XPathEvalError: Invalid expression, stack trace is File "lxml.etree.pyx", line 2029, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:45934) File "xpath.pxi", line 379, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:114389) File "xpath.pxi", line 242, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:113063) File "xpath.pxi", line 228, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:112935) – Bob Evans Oct 18 '11 at 14:15
  • mmh, error is raised by `lxml`, because expression is said to be invalid. Could you please paste the value of `xpath`, when rendered by `print`? – Joël Oct 18 '11 at 14:19
  • 1
    escaping the ' and " with their corresponding HTML entities did the trick. I was really tired last night and wasn't thinking that the string was actually HTML being parsed. Thanks for providing this guidance – Bob Evans Oct 18 '11 at 14:22
  • Great, so that's what I thought: `lxml` is not really happy when these characters are used directly in content. You're welcome - please do not forget to accept the answer! – Joël Oct 18 '11 at 14:31
  • Issue is i had to abandon this because it was causing a lot of headaches. Later on I ran into a problem that href's with underscores could not return a valid xpath but only when writing a unit test, it worked fine in the python shell itself. Also I was dealing with horrible HTML and I also found invalid chars in alt attributes. So with a little trial and error I have things working but I removed the starts-with part of the xpath and asserting text of the tag separately – Bob Evans Oct 19 '11 at 01:47