0

I'm scraping a website with Selenium / Python3, the website only uses invalid selectors like:

<input id="egg:bacon:SPAM" type="text"/>
<input id="egg:sausages:SPAM:SPAM" type="text"/>

(invalid parts are egg:bacon:SPAM & egg:sausages:SPAM:SPAM)

I did try to select these tags with:

driver.find_element_by_css_selector('input#egg:bacon:SPAM')

But of course I get selenium.common.exceptions.InvalidSelectorException


I also did try using xpath to get my tags, it works with:

driver.find_element_by_xpath('//input[@id="egg:bacon:SPAM"]')

But my code is based on a home made library based on CSS selectors. Adding XPATH support would require to add ~200 lines of code (without counting unit tests, documentation, etc..) only to handle this wrong and not generic behavior.

Plus, scraping this website is part of a bigger project where only this specific website use that kind of CSS selectors, pushing that much effort for a single website on 10 makes me uncomfortable.


I could use something like find_element_by_css_selector('.foo > input:nth-child(2)') but it's pretty tricky and any small update on the DOM could break the scraper.

Is there any clean way to handle non valid css selectors via Selenium using find_element_by_css_selector or am I doomed to use XPATH for this website?

Arount
  • 9,853
  • 1
  • 30
  • 43

2 Answers2

2

They all valid. You need to escape special characters or use quotes:

driver.find_element_by_css_selector('input[id="egg:bacon:SPAM"]')
driver.find_element_by_css_selector('input#egg\:bacon\:SPAM')
Sers
  • 12,047
  • 2
  • 12
  • 31
1

To identify an element with id attribute containing reserved characters, e.g. egg:bacon:SPAM, egg:sausages:SPAM:SPAM you can use dynamic with the following wildcards :

  • ^ : To indicate an attribute value starts with
  • * : To indicate an attribute value contains
  • $ : To indicate an attribute value ends with

Solution

You can use the following solutions:

  • To identify the element <input id="egg:bacon:SPAM" type="text"/>:

    driver.find_element_by_css_selector("input[id^='egg'][id*='bacon'][id$='SPAM']")
    
  • To identify the element <input id="egg:sausages:SPAM:SPAM" type="text"/>:

    driver.find_element_by_css_selector("input[id^='egg'][id*='sausages'][id$='SPAM']")
    

Reference

You can find a couple of relevant discussions in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1
    Super nice, it works. But I have few inputs like `egg:bacon:SPAM` & `egg:bacon:SPAM:SPAM` on the same page. As I understand your anwser it uses _a kind of regex expression_ (`^`, `*`, `$`) and I fear the example I gave in this comment would not be supported with this method. Also do you have a doc or keyword so I can find doc about this? (_+1 anyway_) – Arount Feb 21 '20 at 10:41
  • 1
    @Arount `^`, `*` and `$` aren't _regex expression_ as such :) but **wildcards** used with _cssSelectors_. Checkout the updated answer and let me know the status. – undetected Selenium Feb 21 '20 at 10:45
  • 1
    Thanks, very nice to know and super hepful. I will still validate Sers' anwser because it's less verbose (and a `replace(':', '\\:')` at the right place do the job) but I keep my upvote because it's very good answer (and yeah, wildcards.. ooops :D) – Arount Feb 21 '20 at 10:47
  • 1
    Just for record, I just had a situation where I had to use your wildcards, epic. – Arount Feb 21 '20 at 11:42
  • 1
    @Arount This answer is based on best practices which you have to adapt in the longer run. – undetected Selenium Feb 21 '20 at 11:55
  • I will stop commenting after that because it does not add quality to your answer but yea, it's a game changer for me. Very nice life improvement. 10/10 will use everyday from now – Arount Feb 21 '20 at 11:57