Regex or DOM for splitting an html string with just one element level dept

Question

I'm supposed to split a html string at any occurrence of a element "tag" with attribute "type" with value "findMe". There could be other random attributes and random innerHTML.

Valid match: <tag type="findMe" any-other-attr="value">badabing</tag>

An example of the intended outcome:

Input:

some html text <br> with some formatting<tag id="1" type="findMe">sample text</tag> yada <tag id="2" type="dontFidMe">sample text</tag>yada

Output:

[0]: some html text <br> with some formatting
[1]: <tag id="2" type="dontFidMe"> yada yada

I've started some progress by building a regular expression to split the string but it still has some issues. If I have adjacent "tag" elements and only one of them has the type attribute "findMe", the regular expression will greedily match both of them.

(?=<tag.*?type=(?:"|')findMe(?:"|').*?\/tag>)

I know I shouldn't parse html with regular expressions, but since I'm dealing with just one element dept level and I know before-hand what to expect, I wonder what could be most efficient in terms of performance and memory.

Parsing the html string to a in-memory DOM element and iterate all the nodes and splitting by tag elements with type attribute with value "findMe"?

OR

Creating a regular expression to find all tag elements with attribute value "findMe"? (if so, any help to improve the above regular expression is welcome)

Don't use Regex to parse HTML, lest you incur the wrath of [Tony the Pony!](http://stackoverflow.com/a/1732454/519413) — Rory McCrossan, Dec 16 '16 at 09:45

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

One simple solution I believe would be to change the . inside the tag, to match anything but the closing >.

(?=<tag[^>]*?type=["']findMe["'])

See it here at regex101. (The substitution with "\n[break]\n" is only to illustrate the split)

Note that an attribute containing > - like <tag someattr="123>456" type="findMe"> would break it though. (That's one reason why regex normally isn't suitable for parsing HTML ;)

But I guess the regex approach would be beneficial in your (simple) case.

Edit Altered the regex101 example to better illustrate the solution.

Regex or DOM for splitting an html string with just one element level dept

1 Answers1