Regex - matching html element with child elements on multiple lines

Question

I have a simple piece of HTML code.

<tr>
OtherElement
</tr>
<tr>
HelloWorld
</tr>

I need to match the <tr></tr> element containing HelloWorld. I am using this regular expression but it matches first element as well.

<tr[\s\S]*?HelloWorld[\s\S]*?<\/tr>

I am using Node.js so I can not use look behind.

Do you really need (_melius abundare quam deficere_) to parse **broken HTML** with regexes? Oh and where multiple lines on child elements are?! — Adriano Repetti, Jan 08 '16 at 16:56

Thriggle · Answer 1 · 2016-01-08T17:33:08.223

1

There's an error in your regular expression. This character set is too permissive: [\s\S]*?

Try the following:

<tr>\s*HelloWorld\s*<\/tr>

\s* means 0 or more whitespace characters and nothing else.

And you may want to examine why you're using RegEx to parse HTML. This can be a useful approach for working with string snippets of known HTML, such as from a database, but in JavaScript you're probably better off using an XML parser or the DOM query selector methods.

edited Jan 08 '16 at 17:33

answered Jan 08 '16 at 17:02

Thriggle

7,009
2
26
37

How is `[\s]` different from `\s`? – Jan 08 '16 at 17:24
1

@torazaburo it's not... That's what I get for modifying somebody else's RegEx instead of starting from scratch! Thanks for the correction, I've edited my answer. – Thriggle Jan 08 '16 at 17:32

score 1 · Answer 2 · answered Jan 08 '16 at 17:22

1

Don't parse HTML with regexps. Instead, use DOM routines and properties:

function find_hello_world() {
  var trs = document.querySelectorAll('tr');

  for (var i=0; i<trs.length; i++) 
    if (trs[i].textContent === "HelloWorld") return trs[i];

}

answered Jan 08 '16 at 17:22

I can not use DOM since I am not in the browser but in Node.js environment. – JeFf Jan 09 '16 at 08:36

score 1 · Accepted Answer · answered Jan 09 '16 at 00:34

I assume you receive the HTML fragment as a string. So, you need to parse it with DOM parser (after replacing all tr tags with another custom name since otherwise parsing will fail) and get only those tr elements that contain (not are equal to) the string HelloWorld.

var $txt = "<tr>\nOtherElement\n</tr>\n<tr>Initial text\nHelloWorld\nSome other text</tr>";
var $el = document.createElement( 'body' );
$el.innerHTML = $txt.replace(/<(\/?)tr\b([^<]*)>/g, "<$1tablerows$2>"); // normalize TR tags as tablerows tags
var $arr = [];
[].forEach.call($el.getElementsByTagName("tablerows"), function(v,i,a) {
    if (v.innerText.indexOf("HelloWorld") > -1) {
  $arr.push(v.innerText);
    }
});
document.write(JSON.stringify($arr, 0, 4));

A regex solution is nasty and fragile, but possible:

<tr\b[^<]*>[^<]*(?:<(?!tr\b)[^<]*)*HelloWorld[^<]*(?:<(?!\/tr>)[^<]*)*<\/tr>

See regex demo

The regex uses an unroll the loop technique to match the closest subpatterns.

<tr\b[^<]*> - matches an opening TR tag
[^<]*(?:<(?!tr\b)[^<]*)* - matches anything but <tr up to the
HelloWorld - literal sequence
[^<]*(?:<(?!\/tr>)[^<]*)* - all but closing </tr>
<\/tr> - closing TR tag

I am using node.js so I can not use DOM parser, but your regex solution works like a charm. Thanks — JeFf, Jan 09 '16 at 10:08
Not sure if [that answer](http://stackoverflow.com/a/7373003/3832970) is still relevant, but it says you can use the [npm](http://npmjs.org/) modules [jsdom](https://www.npmjs.org/package/jsdom) and [htmlparser](https://www.npmjs.org/package/htmlparser) to create and parse a DOM in Node.JS. — Wiktor Stribiżew, Jan 09 '16 at 17:23

Regex - matching html element with child elements on multiple lines

3 Answers3