How to write Regex to match a XML nodes with specific text in a large text stream

Question

I want to use an regex in javascript to match all xml nodes in a text file which has some other text in it as well.

I tried using <NotificationMessage>(.|\n)+[STATUS_CHANGE]*<\/NotificationMessage> for matching the NotificationMessage nodes in it but the regex is not limited to the element, it captures additional text as well. I have also tried with /<NotificationMessage>(.|\r\n)+?<\/NotificationMessage>/g but this ignores the 'Name' node of notification described in the text below.

By this I mean that i want to selectively pick some XML nodes in a large text files which is containing huge padding data of logs and this is nowhere related to XML Parsing as mentioned by some folks

example text:

.. bla b;la bla some text of large log.......<?xml version="1.0" encoding="UTF-8"?><NotificationMessage>
        <Header>
            <Name>STATUS_CHANGE</Name>
            <Description/>
            <SomeOher/>
        </Header>
        <Body>
            <Values>
                <Key="Good" timeStamp="2017-11-01T17:47:11.7107581Z" type="xsd:string"><![CDATA[12343656]]></Key>
            </Values>
        </Body>
        <Faults/>
    </NotificationMessage>
#SOME other text continued..
.. bla b;la bla some text.......

     <?xml version="1.0" encoding="UTF-8"?><NotificationMessage>
        <Header>
            <Name>SOME_OTHER NOTIFICATION</Name>
            <Description/>
            <SomeOher/>
        </Header>
        <Body>
            <Values>
                <Key="Good" timeStamp="2017-11-01T17:47:11.7107581Z" type="xsd:string"><![CDATA[12343656]]></Key>
            </Values>
        </Body>
        <Faults/>
    </NotificationMessage>

#SOME other text with $pec1Al ch@r@cters continued..

Edit 1

I have already tried an alternative solution :

var log = `Long stream of text containing above text with XML`
var regexp = /<NotificationMessage>(.|\r\n)+?<\/NotificationMessage>/g;
var matches_array = log.match(regexp);
for (let i = 0; i < matches_array.length; i++) {
  if(matches_array[i].indexOf("STATUS_CHANGE")>0){
    console.log(matches_array[i]);
}

But I want to do all this in 1 regular expression so as to improve performance. Also suggest would it really make a performance improvement or not.

Edit 2

Also my use case for this does not involve parsing of the extracted XMLs instead I have to dump it,so I want to avoid use of XML parsers

Please read carefully , i want to selectively pick some XML nodes in a large text files which is containing huge padding data of logs and this is nowhere related to XML Parsing as mentioned by some folks — Samdeesh, Dec 28 '17 at 06:44
Make a note that this is not having any self contained tags, the xml blocks are having a large text(truncated) in between. — Samdeesh, Dec 28 '17 at 07:05
What about taking the problem the other way around? It seems easier to clean your string via regex and then process the resulting output with a xml parser, as described in my answer — Allan, Dec 28 '17 at 08:03

Tomalak · Answer 1 · 2017-12-28T08:35:34.403

You can use this to pick the XML parts specific to your case from the string.

<\?xml[\s\S]*?<\/NotificationMessage>

After that, use a DOM parser and DOM methods (or XPath) to select the correct node and read its value. The following is cited from "Parsing and serializing XML" on the MDN.

var sMyString = '<a id="a"><b id="b">hey!</b></a>';

var oParser = new DOMParser();
var oDOM = oParser.parseFromString(sMyString, "text/xml");
// print the name of the root element or error message
dump(oDOM.documentElement.nodeName == "parsererror" ? "error while parsing" : oDOM.documentElement.nodeName);

I expect that one or two simple calls to getElementsByTagName() would already be sufficient for your situation.

Note 1: If there are other XML sections in your string than <NotificationMessage>, a more specific regex must be used to find them:

<\?xml.*?\?><NotificationMessage\s?[\s\S]*?<\/NotificationMessage>

Note 2: If the <NotificationMessage> element can occur nested, this approach will fail.

haha! not really the case .. the solution is not what was desired.. /(.|\r\n)+?<\/NotificationMessage>/g gets exacly the notification nodes but the other filters are skipped.. my concern was to optimize the solution and I was unable to tag you earlier @Tomalak — Samdeesh, Dec 28 '17 at 11:51
No. It's complete, i.e. has all the code samples, your own attempt and enough explanation to make sense of it. There's no reason to vote it down. — Tomalak, Dec 29 '17 at 10:10

score 0 · Answer 2 · answered Dec 28 '17 at 07:59

0

What you can do is processing it the other way around:

1) Apply the following regex:

(?<=<\/NotificationMessage>)[^<]*<\?xml version="1\.0" encoding="UTF-8"\?>

to clean everything that is not XML in your string

#SOME other text continued..
.. bla b;la bla some text.......

     <?xml version="1.0" encoding="UTF-8"?>

and replace it by a new line.

2) add starting tag and ending tag: <NotificationMessages> and </NotificationMessages> at the beginning and end of your script.

<NotificationMessages>
 <NotificationMessage>...</NotificationMessage>
 <NotificationMessage>...</NotificationMessage>
 <NotificationMessage>...</NotificationMessage>
 <NotificationMessage>...</NotificationMessage>
                      ...
 <NotificationMessage>...</NotificationMessage>
</NotificationMessages>

3) use your favorite XML parser to parse the XML tree and extract individually all NotificationMessage XML nodes.

and there you go! ;-)

answered Dec 28 '17 at 07:59

Allan

12,117
3
27
51

1

Downvoted for recommending regex to parse XML... *sigh* this is actually a grey area. Downvote undone. – Tomalak Dec 28 '17 at 08:10
I am not using regex to parse the XML at all. I am using it to clean this file in order to parse it with some XML parser. – Allan Dec 28 '17 at 08:17
1

JS regex does not have look-behind, though. – Tomalak Dec 28 '17 at 08:19
@Allan, Have a look at the edit, this is exactly what I am doing which achieves the solution but I want an optimal way to do it – Samdeesh Dec 28 '17 at 08:20
@Tomalak Thanks for the comment! Nice answer *upvote* :-) – Allan Dec 28 '17 at 08:27

How to write Regex to match a XML nodes with specific text in a large text stream

2 Answers2