Recursively reading an xml document and using regex to get contents

Question

I have an xml document like the following:

<menuitem navigateurl="/PressCentre/" text="&#1087;&#1088;&#1077;&#1089; &#1094;&#1077;&#1085;&#1090;&#1098;&#1088;">
    <menuitem navigateurl="/PressCentre/RegisterForPressAlerts/" text="&#1088;&#1077;&#1075;&#1080;&#1089;&#1090;&#1098;&#1088; &#1079;&#1072; &#1087;&#1088;&#1077;&#1089; &#1089;&#1098;&#1086;&#1073;&#1097;&#1077;&#1085;&#1080;&#1103;" />
    <menuitem navigateurl="/PressCentre/PressReleases/" text="&#1087;&#1088;&#1077;&#1089; &#1089;&#1098;&#1086;&#1073;&#1097;&#1077;&#1085;&#1080;&#1103;">
        <menuitem navigateurl="/PressCentre/PressReleases/PressReleasesArchive/" text="&#1072;&#1088;&#1093;&#1080;&#1074; &#1087;&#1088;&#1077;&#1089; &#1089;&#1098;&#1086;&#1073;&#1097;&#1077;&#1085;&#1080;&#1103;" />
    </menuitem>
    <menuitem navigateurl="/PressCentre/PressKit/" text="&#1087;&#1088;&#1077;&#1089; &#1082;&#1086;&#1084;&#1087;&#1083;&#1077;&#1082;&#1090;">
        <menuitem navigateurl="/PressCentre/PressKit/FactSheets/" text="&#1089;&#1087;&#1080;&#1089;&#1098;&#1082; &#1092;&#1072;&#1082;&#1090;&#1080;" />
        <menuitem navigateurl="/PressCentre/PressKit/ExpertComments/" text="&#1082;&#1086;&#1084;&#1077;&#1085;&#1090;&#1072;&#1088;&#1080; &#1085;&#1072; &#1077;&#1082;&#1089;&#1087;&#1077;&#1088;&#1090;&#1080;" />
        <menuitem navigateurl="/PressCentre/PressKit/Testimonials/" text="&#1087;&#1088;&#1077;&#1087;&#1086;&#1088;&#1098;&#1082;&#1080;" />
        <menuitem navigateurl="/PressCentre/PressKit/MediaFiles/" text="&#1084;&#1077;&#1076;&#1080;&#1103; &#1092;&#1072;&#1081;&#1083;&#1086;&#1074;&#1077;" />
        <menuitem navigateurl="/PressCentre/PressKit/Photography/" text="&#1089;&#1085;&#1080;&#1084;&#1082;&#1080;" />
    </menuitem>
    <menuitem navigateurl="/PressCentre/PressContacts/" text="&#1087;&#1088;&#1077;&#1089; &#1082;&#1086;&#1085;&#1090;&#1072;&#1082;&#1090;&#1080;" />
</menuitem>

I need to get the value between navigateurl (e.g. "/PressCentre"). Is there a well known regex script to do this?

Thanks

Could you please explain a bit more? What do you mean by "get the variable between navigate url"? What is the expected output? — Andrew Hare, May 05 '09 at 15:43
What happened to the C# tag? Is this a C# question or an XSLT one? — annakata, May 05 '09 at 15:49
and fwiw, regex cannot parse non-regular grammar like a nested xml structure — annakata, May 05 '09 at 15:50

annakata · Accepted Answer · 2009-05-05T16:03:41.883

6

A basic recursion (not tested but I think it's ok):

private void Caller(String filepath)
{
    XPathDocument oDoc = new XPathDocument(filepath);
    Readnodes( oDoc.CreateNavigator() );
}

private void ReadNodes(XPathNavigator nav)
{
    XPathNodeIterator nodes = nav.Select("menuitem");
    while (nodes.MoveNext())
    {
        //A - read the attribute
        string url = nodes.Current.GetAttribute("navigateurl", string.Empty);

        //B - do something with the data

        //C - recurse
        ReadNodes(nodes.Current);
    }
}

...works because an XPathNodeIterator's Current property is also an XPathNavigator. Obviously you'd need to extend this to push data to a dictionary or keep track of depth or whatever.

edited May 05 '09 at 16:03

answered May 05 '09 at 15:48

annakata

74,572
17
113
180

Doh! You beat me to it, and with an example, too. – ZombieSheep May 05 '09 at 15:49
heh - but now I'm utterly baffled as to what the question *is* anymore :) – annakata May 05 '09 at 15:51
Thanks! I'll give that a try. There's some new classes for me to learn :) – GurdeepS May 05 '09 at 15:55
1

Also of note is the fact that you can create XPath expressions that handle the recursion, so you don't need to make the function recurse. For a quick example: http://msdn.microsoft.com/en-us/library/ms256086.aspx – el2iot2 May 05 '09 at 16:07
1

i.e. nav.Select("//menuitem") should get all menu items recursively – el2iot2 May 05 '09 at 16:11
Yeah, I really should have mentioned that - really good call, and likely an even simpler, better answer in practice - but the reason I've provided this as is is to demonstrate recursion. Also worth noting that "//menuitem" won't give you the option of additional logic i.e. around depth checks. – annakata May 05 '09 at 16:34

score 1 · Answer 2 · answered May 05 '09 at 15:49

1

Why use Regex for this when XPath is (to me, at least) the natural choice? That's basically what XSLT should implement...

answered May 05 '09 at 15:49

ZombieSheep

29,603
12
67
114

1

Simply because regex I don't have much experience of. – GurdeepS May 05 '09 at 16:24

score 0 · Answer 3 · answered Oct 26 '11 at 15:35

My post addresses a specific need related to the OP's inquiry, but not specifically what the OP asked. I love both Regex and recursion when I need them, but in this case I think the goal of the OP's inquiry was to learn a way to generate properly-formatted XML output, and what I've provided below does exactly that with no heavy contextual source development (why reinvent the wheel?) and is supported in back in the .NET 2.0 framework.

In my work, I often end up supporting modern government systems. Those systems often still only support up through 2.0 on deployment systems -- primarily for reasons of security. The 2.0 Framework lacks some of the graceful output of more recent .NET editions, particularly where XML objects are concerned. The fully validated method-set below has been valuable and time-saving to me and I offer it for unseen developer comrades who also service government interests.

Additionally you can also utilize LinqBridge libraries for limited Linq support (.NET up through 3.5 service-pack actually internally self-evaluates to 2.0 so LinqBridge was constructed to bridge that specific gap (limited Linq query support while developing to 2.0 build while using Visual Studio 2008). However, note that LinqBridge is currently not supported forward of Visual Studio 2008.

In order to minimize package publish-sizes and also stay compatible with the organizational requirements where I provide my services I avoid using associative non-XML libraries (such as Regex) for parsing XML and stick to standard XML objects. Specifically the older Xml*-prefix objects vs the more modern (and much more flexible) X*-prefix objects...

Below I provide numerous safe, simple, efficient methods that generate formatted XML from an assortment of standard 2.0 Xml* objects. Also note that the workhorse for these functions is really the XPathNavigator class, not it's cousins.

Here is a C# code fragment that calls the sample methods:

doc = new XmlDocument();
doc.Load(Input_FilePath);
sb = StringBuilderFromXmlDocument(doc);
Out(sb);
sb = StringBuilderFromXPathDocument(new XPathDocument(Input_FilePath));
Out(sb);
sb = StringBuilderFromXPathNavigator(new XPathDocument(Input_FilePath).CreateNavigator());
Out(sb);
ss = StringFromXmlDocument(doc);
Out(ss);
ss = StringFromXPathDocument(new XPathDocument(Input_FilePath));
Out(ss);
ss = StringFromXPathNavigator(new XPathDocument(Input_FilePath).CreateNavigator());
Out(ss);

and here are the sample methods, one of which will likely suffice your XML formatting needs:

public static StringBuilder StringBuilderFromXmlDocument(XmlDocument _xd)
{
    XPathNavigator _xpn;
    try
    {
        _xpn = _xd.CreateNavigator();
    }
    catch
    {
        _xd.LoadXml(DEFAULT_ERROR_TEXT);
        _xpn = _xd.CreateNavigator();
    }
    return StringBuilderFromXPathNavigator(_xpn);
}

private static StringBuilder StringBuilderFromXPathDocument(XPathDocument _xpd)
{
    StringBuilder returnVal = new StringBuilder();
    XPathNavigator _xpn;
    try
    {
        _xpn = _xpd.CreateNavigator();
        returnVal.AppendLine(_xpn.OuterXml.Trim());
    }
    catch
    {
        returnVal = new StringBuilder()
            .Append(DEFAULT_ERROR_TEXT);
    }
    return returnVal;
}

private static StringBuilder StringBuilderFromXPathNavigator(XPathNavigator _xpn)
{
    StringBuilder returnVal = new StringBuilder();
    try
    {
        returnVal.AppendLine(_xpn.OuterXml.Trim());
    }
    catch
    {
        returnVal = new StringBuilder()
            .Append(DEFAULT_ERROR_TEXT);
    }
    return returnVal;
}

public static string StringFromXmlDocument(XmlDocument _xd)
{
    XPathNavigator _xpn;
    try
    {
        _xpn = _xd.CreateNavigator();
    }
    catch
    {
        _xd.LoadXml(DEFAULT_ERROR_TEXT);
        _xpn = _xd.CreateNavigator();
    }
    return StringFromXPathNavigator(_xpn);
}

private static string StringFromXPathNavigator(XPathNavigator _xpn)
{
    string returnVal;
    try
    {
        returnVal = _xpn.OuterXml.Trim();
    }
    catch
    {
        returnVal = DEFAULT_ERROR_TEXT;
    }
    returnVal = _xpn.OuterXml.Trim();
    return returnVal;
}

private static string StringFromXPathDocument(XPathDocument _xpd)
{
    string returnVal;
    XPathNavigator _xpn;
    try
    {
        _xpn = _xpd.CreateNavigator();
        returnVal = _xpn.OuterXml.Trim();
    }
    catch
    {
        returnVal = DEFAULT_ERROR_TEXT;
    }
    return returnVal;
}

enjoy. ^^

Note that in later Framework editions and using newer XElement objects you can foreach(){} the XElement's nodes and .ToString() each result for automated proper formatting. Like I said above, much more graceful :).

what was added above about XPath is correct, you'll likely be best-served using that class-grouping to sift XML-structures -- and here's some text to aid 'tag matching': 2, parse, render, search — Hardryv, Oct 26 '11 at 18:12

score 0 · Answer 4 · answered May 05 '09 at 15:51

0

Any particular reason you're using a regex? Have you tried using XPath for this? Here are some examples of how to use XPath. http://www.w3schools.com/XPath/xpath_examples.asp

answered May 05 '09 at 15:51

Rad

8,336
4
46
45

score 0 · Answer 5 · answered May 06 '09 at 12:50

0

Use xpath, //menuitem[@navigateurl]/@navigateurl .

This xpath will grab all the menu items which have an attribute naviagate url and return a node-list (xpath 1.0) or sequence (xpath 2.0) of navigateurl values. By having the navigateurl attribute predicate, that ensures that only the leaf menu items are fetched.

answered May 06 '09 at 12:50

JavaRocky

19,203
31
89
110

Your xpath is wrong for leaves actually - "//menuitem[not(menuitem)]" would capture all the leaves only (not that this is what the OP requests). I still like this as a solution generally, but the OP's position on recursion has not been clarified. – annakata May 06 '09 at 13:21

score 0 · Answer 6 · edited Oct 27 '12 at 19:22

How to Recursively read an XML document using regex in Java

public static void main(String[] args) {
        String data**="<CheckExistingDSLService>" +
                "<DSLPN>4137361787</DSLPN>" +
                "<DSLPN>8566944014</DSLPN>" +
                "<ClientRequestId>CRID</ClientRequestId>" +
                "<DSLPN>8566944024</DSLPN>" +
                "<ClientSystemId>SSPORD</ClientSystemId>" +
                "<Authentication>" +
                "<Id>SSPORD</Id>" +
                "</Authentication>" +
                "<Comment>Service to check CheckExistingDSL</Comment>"** +
                "</CheckExistingDSLService>";
        System.out.print("The dats is "+listDataElements(data));

    }
    private static final Pattern PATTERN_1 = Pattern.compile("<([^<>]+)>([^<>]+)</\\1>"); 
    private static List<String> listDataElements(CharSequence cs) {     
        List<String> list = new ArrayList<String>();     
        Matcher matcher = PATTERN_1.matcher(cs);    
        while (matcher.find()) {         
            if(matcher.group(1).equalsIgnoreCase("DSLPN")){
                try{
                    Long number=Long.parseLong(matcher.group(2));
                    list.add(number.toString());

                }catch(Exception e){
                    System.out.println("Do noting this is notnumber ");                 
                }
            }
        } return list; 
    }

The Output you will get: The date is [4137361787, 8566944014, 8566944024]

Recursively reading an xml document and using regex to get contents

6 Answers6

Linked