0

I want to use an regex in javascript to match all xml nodes in a text file which has some other text in it as well.

I tried using <NotificationMessage>(.|\n)+[STATUS_CHANGE]*<\/NotificationMessage> for matching the NotificationMessage nodes in it but the regex is not limited to the element, it captures additional text as well. I have also tried with /<NotificationMessage>(.|\r\n)+?<\/NotificationMessage>/g but this ignores the 'Name' node of notification described in the text below.

By this I mean that i want to selectively pick some XML nodes in a large text files which is containing huge padding data of logs and this is nowhere related to XML Parsing as mentioned by some folks

example text:

.. bla b;la bla some text of large log.......<?xml version="1.0" encoding="UTF-8"?><NotificationMessage>
        <Header>
            <Name>STATUS_CHANGE</Name>
            <Description/>
            <SomeOher/>
        </Header>
        <Body>
            <Values>
                <Key="Good" timeStamp="2017-11-01T17:47:11.7107581Z" type="xsd:string"><![CDATA[12343656]]></Key>
            </Values>
        </Body>
        <Faults/>
    </NotificationMessage>
#SOME other text continued..
.. bla b;la bla some text.......

     <?xml version="1.0" encoding="UTF-8"?><NotificationMessage>
        <Header>
            <Name>SOME_OTHER NOTIFICATION</Name>
            <Description/>
            <SomeOher/>
        </Header>
        <Body>
            <Values>
                <Key="Good" timeStamp="2017-11-01T17:47:11.7107581Z" type="xsd:string"><![CDATA[12343656]]></Key>
            </Values>
        </Body>
        <Faults/>
    </NotificationMessage>

#SOME other text with $pec1Al ch@r@cters continued..

Edit 1

I have already tried an alternative solution :

var log = `Long stream of text containing above text with XML`
var regexp = /<NotificationMessage>(.|\r\n)+?<\/NotificationMessage>/g;
var matches_array = log.match(regexp);
for (let i = 0; i < matches_array.length; i++) {
  if(matches_array[i].indexOf("STATUS_CHANGE")>0){
    console.log(matches_array[i]);
}

But I want to do all this in 1 regular expression so as to improve performance. Also suggest would it really make a performance improvement or not.

Edit 2

Also my use case for this does not involve parsing of the extracted XMLs instead I have to dump it,so I want to avoid use of XML parsers

Samdeesh
  • 905
  • 11
  • 23
  • 2
    Dont use regex to parse xml. – mattdevio Dec 28 '17 at 06:42
  • Please read carefully , i want to selectively pick some XML nodes in a large text files which is containing huge padding data of logs and this is nowhere related to XML Parsing as mentioned by some folks – Samdeesh Dec 28 '17 at 06:44
  • Make a note that this is not having any self contained tags, the xml blocks are having a large text(truncated) in between. – Samdeesh Dec 28 '17 at 07:05
  • What about taking the problem the other way around? It seems easier to clean your string via regex and then process the resulting output with a xml parser, as described in my answer – Allan Dec 28 '17 at 08:03
  • Please clarify the reason for downvoting the question – Samdeesh Dec 29 '17 at 10:35

2 Answers2

1

You can use this to pick the XML parts specific to your case from the string.

<\?xml[\s\S]*?<\/NotificationMessage>

After that, use a DOM parser and DOM methods (or XPath) to select the correct node and read its value. The following is cited from "Parsing and serializing XML" on the MDN.

var sMyString = '<a id="a"><b id="b">hey!</b></a>';

var oParser = new DOMParser();
var oDOM = oParser.parseFromString(sMyString, "text/xml");
// print the name of the root element or error message
dump(oDOM.documentElement.nodeName == "parsererror" ? "error while parsing" : oDOM.documentElement.nodeName);

I expect that one or two simple calls to getElementsByTagName() would already be sufficient for your situation.


Note 1: If there are other XML sections in your string than <NotificationMessage>, a more specific regex must be used to find them:

<\?xml.*?\?><NotificationMessage\s?[\s\S]*?<\/NotificationMessage>

Note 2: If the <NotificationMessage> element can occur nested, this approach will fail.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • haha! not really the case .. the solution is not what was desired.. /(.|\r\n)+?<\/NotificationMessage>/g gets exacly the notification nodes but the other filters are skipped.. my concern was to optimize the solution and I was unable to tag you earlier @Tomalak – Samdeesh Dec 28 '17 at 11:51
  • 1
    No. It's complete, i.e. has all the code samples, your own attempt and enough explanation to make sense of it. There's no reason to vote it down. – Tomalak Dec 29 '17 at 10:10
0

What you can do is processing it the other way around:

1) Apply the following regex:

(?<=<\/NotificationMessage>)[^<]*<\?xml version="1\.0" encoding="UTF-8"\?> 

to clean everything that is not XML in your string

#SOME other text continued..
.. bla b;la bla some text.......

     <?xml version="1.0" encoding="UTF-8"?>

and replace it by a new line.

2) add starting tag and ending tag: <NotificationMessages> and </NotificationMessages> at the beginning and end of your script.

<NotificationMessages>
 <NotificationMessage>...</NotificationMessage>
 <NotificationMessage>...</NotificationMessage>
 <NotificationMessage>...</NotificationMessage>
 <NotificationMessage>...</NotificationMessage>
                      ...
 <NotificationMessage>...</NotificationMessage>
</NotificationMessages>

3) use your favorite XML parser to parse the XML tree and extract individually all NotificationMessage XML nodes.

and there you go! ;-)

Allan
  • 12,117
  • 3
  • 27
  • 51