3

I have an element in a page that looks like this:

<a id="cid-694094:Comment:188384" name="694094:Comment:188384"></a>

If you do document.cssselect("#cid-694094:Comment:188384") you will get:

lxml.cssselect.ExpressionError: The psuedo-class Symbol(u'Comment', 12) is unknown

The solution for that is handled in this question (the person was using Java).

However, when I try that in Python as such:

document.cssselect(r"#cid-694094\:Comment\:188384")

I get:

lxml.cssselect.SelectorSyntaxError: Bad symbol 'cid-694094\': 'unicodeescape' codec can't decode byte 0x5c in position 10: \ at end of string at [Token(u'#', 0)] -> None

The reason for that and a proposed solution can be found in this question. If I understand it correctly I should be doing:

document.cssselect(r"#cid-694094\\:Comment\\:188384")

But this still doesn't work. Instead I once again get:

lxml.cssselect.ExpressionError: The psuedo-class Symbol(u'Comment\', 14) is unknown

Can anybody tell me what I'm doing wrong?

Try it yourself using:

import lxml.html
document = lxml.html.fromstring(
    '<a id="cid-694094:Comment:188384" name="694094:Comment:188384"></a>'
)
document.cssselect(r"#cid-694094\:Comment\:188384")
Community
  • 1
  • 1
Bruce van der Kooij
  • 2,192
  • 1
  • 18
  • 29

2 Answers2

4

Isn't : not allowed in css for id or class?

Here is a work-around:

document.xpath('//a[@id="cid-694094:Comment:188384"]')
Community
  • 1
  • 1
Ski
  • 14,197
  • 3
  • 54
  • 64
  • I'm not sure if it is or is not allowed. But the question I linked to earlier says you can escape the colon in your selector and it should work. Your propose a pretty good work around, but, I assume it would be slower than an actual CSS selector by ID? Because this will have to check all A element right? Maybe I can use getElementById... – Bruce van der Kooij Dec 13 '11 at 12:35
  • Ah, with that link, now I'm pretty sure that it is not allowed. The HTML isn't under my control though, I just scrape it, so I'll just have to work around it. – Bruce van der Kooij Dec 13 '11 at 12:42
  • 3
    Actually csselector is converted to xpath with `lxml.cssselect.css_to_xpath()` – Ski Dec 13 '11 at 12:42
  • Very enlightening! It turns out your work-around is almost exactly what cssselect("#id") would do. Thanks. – Bruce van der Kooij Dec 13 '11 at 12:45
  • Here is a link to source code: https://github.com/lxml/lxml/blob/master/src/lxml/cssselect.py :) – Ski Dec 13 '11 at 12:48
1

: is normally not allowed in ID selectors, and this is indeed the correct way to escape it:

document.cssselect(r"#cid-694094\:Comment\:188384")

However the selector parser in was really broken until recently. (It did not really implement backslash-escapes.) I fixed this in cssselect 0.7 which is now an independent project, extracted from lxml.

http://packages.python.org/cssselect/

The "new" way to use it is a bit more verbose:

import cssselect
document.xpath(cssselect.HTMLTranslator().css_to_xpath('#cid-694094\:Comment\:188384'))

lxml 2.4 (not released yet) will use the new cssselect so the simpler syntax will work too.

Simon Sapin
  • 9,790
  • 3
  • 35
  • 44