Using lxml.html's cssselect to select element with colon in ID attribute

Question

I have an element in a page that looks like this:

<a id="cid-694094:Comment:188384" name="694094:Comment:188384"></a>

If you do document.cssselect("#cid-694094:Comment:188384") you will get:

lxml.cssselect.ExpressionError: The psuedo-class Symbol(u'Comment', 12) is unknown

The solution for that is handled in this question (the person was using Java).

However, when I try that in Python as such:

document.cssselect(r"#cid-694094\:Comment\:188384")

I get:

lxml.cssselect.SelectorSyntaxError: Bad symbol 'cid-694094\': 'unicodeescape' codec can't decode byte 0x5c in position 10: \ at end of string at [Token(u'#', 0)] -> None

The reason for that and a proposed solution can be found in this question. If I understand it correctly I should be doing:

document.cssselect(r"#cid-694094\\:Comment\\:188384")

But this still doesn't work. Instead I once again get:

lxml.cssselect.ExpressionError: The psuedo-class Symbol(u'Comment\', 14) is unknown

Can anybody tell me what I'm doing wrong?

Try it yourself using:

import lxml.html
document = lxml.html.fromstring(
    '<a id="cid-694094:Comment:188384" name="694094:Comment:188384"></a>'
)
document.cssselect(r"#cid-694094\:Comment\:188384")

That's odd, I swear StackOverflow is collapsing backward slashes in the last exception. — Bruce van der Kooij, Dec 13 '11 at 12:16

score 4 · Accepted Answer · edited May 23 '17 at 12:02

4

Isn't : not allowed in css for id or class?

Here is a work-around:

document.xpath('//a[@id="cid-694094:Comment:188384"]')

edited May 23 '17 at 12:02

Community

1
1

answered Dec 13 '11 at 12:28

Ski

14,197
3
54
64

I'm not sure if it is or is not allowed. But the question I linked to earlier says you can escape the colon in your selector and it should work. Your propose a pretty good work around, but, I assume it would be slower than an actual CSS selector by ID? Because this will have to check all A element right? Maybe I can use getElementById... – Bruce van der Kooij Dec 13 '11 at 12:35
Ah, with that link, now I'm pretty sure that it is not allowed. The HTML isn't under my control though, I just scrape it, so I'll just have to work around it. – Bruce van der Kooij Dec 13 '11 at 12:42
3

Actually csselector is converted to xpath with `lxml.cssselect.css_to_xpath()` – Ski Dec 13 '11 at 12:42
Very enlightening! It turns out your work-around is almost exactly what cssselect("#id") would do. Thanks. – Bruce van der Kooij Dec 13 '11 at 12:45
Here is a link to source code: https://github.com/lxml/lxml/blob/master/src/lxml/cssselect.py :) – Ski Dec 13 '11 at 12:48

score 1 · Answer 2 · answered Jun 17 '12 at 08:25

: is normally not allowed in ID selectors, and this is indeed the correct way to escape it:

document.cssselect(r"#cid-694094\:Comment\:188384")

However the selector parser in was really broken until recently. (It did not really implement backslash-escapes.) I fixed this in cssselect 0.7 which is now an independent project, extracted from lxml.

http://packages.python.org/cssselect/

The "new" way to use it is a bit more verbose:

import cssselect
document.xpath(cssselect.HTMLTranslator().css_to_xpath('#cid-694094\:Comment\:188384'))

lxml 2.4 (not released yet) will use the new cssselect so the simpler syntax will work too.

Using lxml.html's cssselect to select element with colon in ID attribute

2 Answers2