js Regex not working as expected. Newline not getting detected

Question

I have a string as follows:

<abc name = "foo">
  <child>bar</child>
</abc>
<xyz>1</xyz>

<abc name = "foo2">
  <child>bar2</child>
</abc>
<xyz>5</xyz>

I have created a regex as follows:

var regexapi = /<abc\s*name\s*=\s*"(.*?)"[\s\S]*?<\/abc>\n*<xyz>/gim;
while ( (resApi = regexapi.exec(data))) {
    array1.push(resApi[0]);
}
console.log(array1[0]);

Now if I don't have the tag <xyz>1</xyz> printing array1[0] should show undefined but it is printing as follows:

    <abc name = "foo">
  <child>bar</child>
</abc>

<abc name = "foo2">
  <child>bar2</child>
</abc>
<xyz>

I think there is some problem in \n* since I'm giving multiline flag. Not sure aout this though. Note that this is without <xyz>1</xyz> tag. I want it to print undefined. Thanks.

What are you actually trying to do here? Also, regex isn't necessarily the best tool for parsing HTML. Actually, JavaScript is an HTML parser, so you might do better using it for this question. — Tim Biegeleisen, Apr 27 '18 at 02:28
I'm taking an xml file as an input and I want to store the value in `` which may or may not be present after the `` tag. If not present I want to store the value as undefined — Rogmier, Apr 27 '18 at 02:32
As @TimBiegeleisen said, using a XML parser such as: https://github.com/Leonidas-from-XIV/node-xml2js would be easier than regex. — Sanketh Katta, Apr 27 '18 at 02:34
You can also use Cheerio (https://github.com/cheeriojs/cheerio) and query you data in a \jQuery-like way. — Diego ZoracKy, Apr 27 '18 at 02:46
**Don't parse XML with regex; use a real XML parser.** See duplicate link (and many other posts here and across the web) for explanations. — kjhughes, Apr 27 '18 at 12:09

Tim Biegeleisen · Answer 1 · 2018-04-27T03:21:56.493

0

You would be better off using an XML parser here. If you insist on using regex, here is one option:

var input = "<abc name = \"foo\">\n\t<child>bar</child>\n</abc>\n<xyz>\n\n<abc name = \"foo2\">\t\n<child>bar2</child>\n</abc>\n<xyz>35</xyz>";
var regex = /<abc[^>]*>(?:(?!<\/abc>)[\s\S]*)<\/abc>\s*<xyz>((?!<xyz>)[\s\S]*)<\/xyz>/g;
var match = regex.exec(input);
console.log(match[1]); // 35

This matches an <abc> tag followed by optional whitespace, then followed immediately by an <xyz> tag. Should that tag be empty, then nothing would be capture in the first capture group match[1].

edited Apr 27 '18 at 03:21

answered Apr 27 '18 at 02:42

Tim Biegeleisen

502,043
27
286
360

Tried this. But then if the tag is empty it is capturing the value in the next `` tag – Rogmier Apr 27 '18 at 03:09
@starkVT Check my updated answer. To get it to work, I needed to add another negative lookahead to make sure it doesn't match across `` tags from different HTML blocks. Hopefully you can see why regex is starting to not look so attractive right now. – Tim Biegeleisen Apr 27 '18 at 03:22
You were very generous with your time to try, but the best answer is really that regex is intrinsically the wrong tool for the job rather than reinforce OP's (and future readers') misconception by providing a partial, brittle solution. – kjhughes Apr 27 '18 at 12:14
@kjhughes So am I deleting this? I could counter your comment by saying that sometimes someone may not have access to an XML parser. – Tim Biegeleisen Apr 27 '18 at 12:27
I've added node.js and browser XML parsing solutions to the duplicate link list. You've taken two steps into the quagmire of XML parsing via regex. It's your call, but if it were me, I'd stop here. Sometimes it's better to say "stay out of the swamp" than to try to address what to do when an endless progression of monsters appear. – kjhughes Apr 27 '18 at 12:43

score 0 · Answer 2 · answered Apr 27 '18 at 03:10

0

Regex:

<\/abc>\n(?:<xyz>(.*)(?=<\/xyz))*

Regex Demo

js Demo

Matches a </abc> followed by <xyz> and value. if <xyz> tag is missing array[0] will return an empty string (not undefined)

answered Apr 27 '18 at 03:10

Matt.G

3,586
2
10
23

1

Like all attempts to process XML using regular expressions, it is of course wrong. For example, it doesn't allow for whitespace to appear in places where XML allows whitespace. – Michael Kay Apr 27 '18 at 07:40

js Regex not working as expected. Newline not getting detected

2 Answers2