0

I have a simple piece of HTML code.

<tr>
OtherElement
</tr>
<tr>
HelloWorld
</tr>

I need to match the <tr></tr> element containing HelloWorld. I am using this regular expression but it matches first element as well.

<tr[\s\S]*?HelloWorld[\s\S]*?<\/tr>

I am using Node.js so I can not use look behind.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
JeFf
  • 346
  • 5
  • 16

3 Answers3

1

There's an error in your regular expression. This character set is too permissive: [\s\S]*?

Try the following:

<tr>\s*HelloWorld\s*<\/tr>

\s* means 0 or more whitespace characters and nothing else.

And you may want to examine why you're using RegEx to parse HTML. This can be a useful approach for working with string snippets of known HTML, such as from a database, but in JavaScript you're probably better off using an XML parser or the DOM query selector methods.

Thriggle
  • 7,009
  • 2
  • 26
  • 37
  • How is `[\s]` different from `\s`? –  Jan 08 '16 at 17:24
  • 1
    @torazaburo it's not... That's what I get for modifying somebody else's RegEx instead of starting from scratch! Thanks for the correction, I've edited my answer. – Thriggle Jan 08 '16 at 17:32
1

Don't parse HTML with regexps. Instead, use DOM routines and properties:

function find_hello_world() {
  var trs = document.querySelectorAll('tr');

  for (var i=0; i<trs.length; i++) 
    if (trs[i].textContent === "HelloWorld") return trs[i];

}
1

I assume you receive the HTML fragment as a string. So, you need to parse it with DOM parser (after replacing all tr tags with another custom name since otherwise parsing will fail) and get only those tr elements that contain (not are equal to) the string HelloWorld.

var $txt = "<tr>\nOtherElement\n</tr>\n<tr>Initial text\nHelloWorld\nSome other text</tr>";
var $el = document.createElement( 'body' );
$el.innerHTML = $txt.replace(/<(\/?)tr\b([^<]*)>/g, "<$1tablerows$2>"); // normalize TR tags as tablerows tags
var $arr = [];
[].forEach.call($el.getElementsByTagName("tablerows"), function(v,i,a) {
    if (v.innerText.indexOf("HelloWorld") > -1) {
  $arr.push(v.innerText);
    }
});
document.write(JSON.stringify($arr, 0, 4));

A regex solution is nasty and fragile, but possible:

<tr\b[^<]*>[^<]*(?:<(?!tr\b)[^<]*)*HelloWorld[^<]*(?:<(?!\/tr>)[^<]*)*<\/tr>

See regex demo

The regex uses an unroll the loop technique to match the closest subpatterns.

  • <tr\b[^<]*> - matches an opening TR tag
  • [^<]*(?:<(?!tr\b)[^<]*)* - matches anything but <tr up to the
  • HelloWorld - literal sequence
  • [^<]*(?:<(?!\/tr>)[^<]*)* - all but closing </tr>
  • <\/tr> - closing TR tag
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    I am using node.js so I can not use DOM parser, but your regex solution works like a charm. Thanks – JeFf Jan 09 '16 at 10:08
  • Not sure if [that answer](http://stackoverflow.com/a/7373003/3832970) is still relevant, but it says you can use the [npm](http://npmjs.org/) modules [jsdom](https://www.npmjs.org/package/jsdom) and [htmlparser](https://www.npmjs.org/package/htmlparser) to create and parse a DOM in Node.JS. – Wiktor Stribiżew Jan 09 '16 at 17:23