Selenium: How to handle invalid CSS selectors in DOM

Question

I'm scraping a website with Selenium / Python3, the website only uses invalid selectors like:

<input id="egg:bacon:SPAM" type="text"/>
<input id="egg:sausages:SPAM:SPAM" type="text"/>

(invalid parts are egg:bacon:SPAM & egg:sausages:SPAM:SPAM)

I did try to select these tags with:

driver.find_element_by_css_selector('input#egg:bacon:SPAM')

But of course I get selenium.common.exceptions.InvalidSelectorException

I also did try using xpath to get my tags, it works with:

driver.find_element_by_xpath('//input[@id="egg:bacon:SPAM"]')

But my code is based on a home made library based on CSS selectors. Adding XPATH support would require to add ~200 lines of code (without counting unit tests, documentation, etc..) only to handle this wrong and not generic behavior.

Plus, scraping this website is part of a bigger project where only this specific website use that kind of CSS selectors, pushing that much effort for a single website on 10 makes me uncomfortable.

I could use something like find_element_by_css_selector('.foo > input:nth-child(2)') but it's pretty tricky and any small update on the DOM could break the scraper.

Is there any clean way to handle non valid css selectors via Selenium using find_element_by_css_selector or am I doomed to use XPATH for this website?

score 2 · Accepted Answer · answered Feb 21 '20 at 10:41

2

They all valid. You need to escape special characters or use quotes:

driver.find_element_by_css_selector('input[id="egg:bacon:SPAM"]')
driver.find_element_by_css_selector('input#egg\:bacon\:SPAM')

answered Feb 21 '20 at 10:41

Sers

12,047
2
12
31

undetected Selenium · Answer 2 · 2020-02-21T10:56:12.660

1

To identify an element with id attribute containing reserved characters, e.g. egg:bacon:SPAM, egg:sausages:SPAM:SPAM you can use dynamic css-selectors with the following wildcards :

^ : To indicate an attribute value starts with
* : To indicate an attribute value contains
$ : To indicate an attribute value ends with

Solution

You can use the following solutions:

To identify the element <input id="egg:bacon:SPAM" type="text"/>:

driver.find_element_by_css_selector("input[id^='egg'][id*='bacon'][id$='SPAM']")

To identify the element <input id="egg:sausages:SPAM:SPAM" type="text"/>:

driver.find_element_by_css_selector("input[id^='egg'][id*='sausages'][id$='SPAM']")

Reference

You can find a couple of relevant discussions in:

edited Feb 21 '20 at 10:56

answered Feb 21 '20 at 10:35

undetected Selenium

183,867
41
278
352

1

Super nice, it works. But I have few inputs like `egg:bacon:SPAM` & `egg:bacon:SPAM:SPAM` on the same page. As I understand your anwser it uses _a kind of regex expression_ (`^`, `*`, `$`) and I fear the example I gave in this comment would not be supported with this method. Also do you have a doc or keyword so I can find doc about this? (_+1 anyway_) – Arount Feb 21 '20 at 10:41
1

@Arount `^`, `*` and `$` aren't _regex expression_ as such :) but **wildcards** used with _cssSelectors_. Checkout the updated answer and let me know the status. – undetected Selenium Feb 21 '20 at 10:45
1

Thanks, very nice to know and super hepful. I will still validate Sers' anwser because it's less verbose (and a `replace(':', '\\:')` at the right place do the job) but I keep my upvote because it's very good answer (and yeah, wildcards.. ooops :D) – Arount Feb 21 '20 at 10:47
1

Just for record, I just had a situation where I had to use your wildcards, epic. – Arount Feb 21 '20 at 11:42
1

@Arount This answer is based on best practices which you have to adapt in the longer run. – undetected Selenium Feb 21 '20 at 11:55
I will stop commenting after that because it does not add quality to your answer but yea, it's a game changer for me. Very nice life improvement. 10/10 will use everyday from now – Arount Feb 21 '20 at 11:57

Selenium: How to handle invalid CSS selectors in DOM

2 Answers2

Solution

Reference