-1

Since the void elements of HTML can not be nested as per definition of void, it seems safe to me to process this HTML subset using regular expressions.

So, for example, I could add a slash before some closing angle brackets, to enable for processing HTML with XML tools:

s/<((?:area|base|br|col|embed|hr|img|input|link|meta|param|source|track|wbr)\b[^/]*?)>/<\1/>/

Is this assumption correct?

Wolf
  • 9,679
  • 7
  • 62
  • 108
  • 1
    OK, but now try it against `` – ctwheels Dec 03 '19 at 15:53
  • Great catch :) ... but honestly should I give up this idea or just refine the actual regex? – Wolf Dec 03 '19 at 15:54
  • 1
    You should give up on it; there's always going to be a way to break it. Try adding `style` to any tag, you'll see you can break your regex with some CSS or even better add JS (e.g. `onclick`) – ctwheels Dec 03 '19 at 15:56
  • 2
    Stop processing HTML with regex. Seriously, this has been discussed so often that it's not even funny anymore. Use an HTML parser and then let it output the DOM as XHTML. Traditionally, tidy has been used for this very task. – Tomalak Dec 03 '19 at 15:59
  • Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ctwheels Dec 03 '19 at 15:59
  • I know [this](https://stackoverflow.com/a/1733489/2932052) but in my very case the HTML is the extracted contents of a help file (CHM) that uses a limited subset. – Wolf Dec 03 '19 at 16:00
  • 1
    The question is, why try and roll your own regex when there is a standard tool available that already does exactly what you need, and does it in the proper way? – Tomalak Dec 03 '19 at 16:03
  • @ctwheels definitely not. I'm aware of (most of) the limits of regex. The self-closing tags of HTML seemed to be *not recursive* to me, that's why I ask. – Wolf Dec 03 '19 at 16:03
  • 1
    They're potentially recursive through attributes. Yes for something as simple as `` you'll be fine to parse it - BUT the moment you add more attributes, you're likely going to need a different pattern (or a pattern would just simply fail) – ctwheels Dec 03 '19 at 16:05

1 Answers1

2

As Tomalak and ctwheels pointed out in their comments, although not visible at first glance, there is some recursion potential also in this limited HTML subset and that's why regular expressions are, again, not powerful enough.

As to process HTML tags, even if void, specific browser knowledge is needed. So switch to a tool like HTML Tidy.

Wolf
  • 9,679
  • 7
  • 62
  • 108