1

I am trying to make a regex for HTML, I am coming up with a few minor issues regarding header html blocks to be selected and title in head for some reason,

To explain it better:

<h5>Thing</h5> will all be selected but I only want <h5> and </h5> selected and it's the same with <title>Test</title> I only want the html tags selected but it selects the whole thing,

here is my regex so far:

/(<\/(\w+)>)|(<(\w+)).+?(?=>)>|(<(\w+))>/ig

  • Don't parse HTML using RegEx, you will fail. Use something like the [HTML agility pack](https://htmlagilitypack.codeplex.com/) – Liam Feb 23 '16 at 11:54
  • Rule 1: don't use RegEx to parse HTML. Rule 2: if you still want to parse HTML with RegEx, see rule 1. [RegEx can only match regular languages, and HTML is not a regular language](http://stackoverflow.com/a/590789/930393) – freefaller Feb 23 '16 at 11:56
  • 1
    I understand these replies, but this was just for a personal project and would like to try and finish this regex and as the only problem I am facing is explained above don't understand why no one else can solve this? – Javi Qualms Pdog Feb 23 '16 at 12:23

1 Answers1

2

Your problem is here: <(\w+).+?(?=>)>

This says:

  1. open an angle bracket
  2. consume as many word characters as possible (min 1)
  3. consume as few characters as possible (min 1)
  4. make sure a closing angle bracket follows
  5. consume the closing angle bracket

First of all, step 4 is superfluous; you know you will have a closing bracket next, otherwise step 5 will fail to match.

But the bigger problem is step 3. Let's see what happens on <h5>Thing</h5>:

  1. <
  2. h5 (because > is not a word character any more)
  3. >Thing</h5, because this is the least amount matched before a closing angle bracket (remember, matching 0 characters here is not an option)
  4. Make sure next is >
  5. >

Anyway, in the simple case, what you want can be done by /<\/?.+?>/. This will break if attributes have values that include a greater than symbol: <div title="a>b">. Avoiding this is possible, but it makes the regexp a bit more complex, kind of like this (but I may have forgotten something):

<\w+(?:\s+\w+(?:=(?:"[^"]*"|'[^']*'|[^'"][^\s>]*)?)?)*\s*>|<\/\w+>
Amadan
  • 191,408
  • 23
  • 240
  • 301