1

Browsers consider an <option> selected by default if it has the selected="selected" attributes. But this somehow works even if that attribute value is omitted.

So

<option selected="selected" value="1">value text</option>

and this works

<option selected value="1">value text</option>

My question is how to write a Regex pattern that matches both conditions above, but never matches something like

<option value="the devil with **selected** ">value text</option>

EDIT: I forgot to mention that some conditions are still considered valid XHTML, like selected='selected', or selected=selected or even selected=SelEctEd

doc_id
  • 1,363
  • 13
  • 41
  • I know that regular expressions are not perfect, if ever useful, to parse XHTML. But in my case there's no way to use other tools like an XML parser – doc_id Dec 22 '15 at 13:09
  • Sry to say this, but I won't do any thinking before I can see your own try on this that does not work. You know exactly whats supposed to go in what should come out, so I see no reason to do your work. ;) – dryman Dec 22 '15 at 13:15
  • I don't think selected=selected is valid XHTML. – Amarnasan Dec 22 '15 at 13:18
  • Empty attributes are [quite well standardized](http://www.w3.org/TR/html-markup/syntax.html#syntax-attr-empty), and even recommended for those attributes. In X(HT)ML this [is not allowed](http://www.w3.org/TR/2000/REC-xhtml1-20000126/#h-4.5) however. – Niels Keurentjes Dec 22 '15 at 13:23
  • selected is a property, its just mether is it set or not – Strahinja Djurić Dec 22 '15 at 13:41
  • 1
    I know this isn't RegEx - but you are using PHP and it's [so simple using DOMDocument so here is some example code](https://3v4l.org/fllP7). – Dean Taylor Dec 22 '15 at 14:09

2 Answers2

0

With PCRE (which PHP uses) this works:

<option.*?\s(?:selected(?:=\"selected\")?)\s.*?>
# look for <option literally
# followed by anything (non greedy) and a whitespace(!)
# open a non capturing group and look for selected, eventually followed by ="selected"
# close the group, followed by a whitespace
# followed by anything (non-greedy) and the closing tag

See a regex 101 demo here. Besides, read the comments, there a good hints (using DomDocument, etc.) in there.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • Unfortunately that matches `` – doc_id Dec 22 '15 at 15:50
  • True. Can you give an overview of your expected input strings then? And **why** exactly is using a decent DOM parser not an option? – Jan Dec 22 '15 at 17:37
  • I deal with documents that might have broken or incorrectly formatted tags for an automated test process. It's a complicated scenario. But overall I decided to give up Regexp and go with DomDocument, I did not know it will handle incorrect XHTML that well. For your other question, I mentioned in the question that all I want is to detect that attribute in any format a typical browser would. – doc_id Dec 22 '15 at 18:40
0

After discussions here, and some other resources like "RegEx match open tags except XHTML self-contained tags" I realized it's impractical to use Regular expressions to accurately parse XHTML.

Community
  • 1
  • 1
doc_id
  • 1,363
  • 13
  • 41