0

I have the following regular expression, which I'm using to find <icon use="some-id" class="some-class" />:

(?:<icon )(?=(?:.*?(?:use=(?:"|')(.*?)(?:"|')))?)(?=(?:.*?(?:class=(?:"|')(.*?)(?:"|')))?)(?:.*?)(?: \/)?[^?](?:>)

This mostly works, except that if I don't specify a class, but do specify one on another element on the same line, it'll match that other elements class, even though the full match is reported as just being the icon element.

For example:

<icon use="search" /> <div class="test"></div>

$1 for that is search, and $2 is test, even though they're not part of the same element. $& is reporting <icon use="search" />.

I'm sure I'm missing something obvious about the way regular expressions work.

JacobTheDev
  • 17,318
  • 25
  • 95
  • 158
  • 2
    My comment would be to use an HTML parser instead of a regex when trying to parse HTML content. – Tim Biegeleisen May 01 '17 at 14:29
  • @TimBiegeleisen I think that's a great idea, but I'm not sure how to go about doing that. This is for part of a gulp task; would you be able to point me in the direction of a tutorial on how to do that with either Gulp or Node? – JacobTheDev May 01 '17 at 15:04
  • 1
    http://stackoverflow.com/questions/7372972/how-do-i-parse-a-html-page-with-node-js – Tim Biegeleisen May 01 '17 at 15:06

1 Answers1

4

The .*? just before the match of class= will match ANYTHING it has to in order to make the rest of the regex match - including the end of the first tag and the start of the second one, and everything that might lie in between. The only restriction you've placed on it is that it can't cross a line boundary, as newlines are not matched by . by default. To make this work somewhat more reliably, you'd need to restrict that part of the regex so that it cannot cross a tag boundary: [^<]+? (one or more characters that aren't a left angle bracket, matching as few as possible) should do the job.

jasonharper
  • 9,450
  • 2
  • 18
  • 42