-2

I have an XML request, the objective is to extract only the XML namespace.

    <s:student xmlns:s="http://www.way2tutorial.com/some_url1"
               xmlns:res="http://www.way2tutorial.com/some_url2">
      <r:result>
        <r:name>Opal Kole</r:name>
        <r:sgpa>8.1</r:sgpa>  
        <r:cgpa>8.4</r:cgpa>    
      </r:result>
      <res:cv>
        <res:name>Opal Kole</res:name>  
        <res:cgpa>8.4</res:cgpa>    
      </res:cv>
    </s:student>

I would not like to parse the XML as the ML parsing can be costly. But is there any way to get just the declared XML Namespaces

Expected Output:

xmlns:s="http://www.way2tutorial.com/some_url1"
xmlns:res="http://www.way2tutorial.com/some_url2"

I have even tried using regular expression, But it the expression was incorrect.

Java Code using regular expression:

    String txt = "<s:student xmlns:s=\"http://www.way2tutorial.com/some_url1\" xmlns:res=\"http://www.way2tutorial.com/some_url2\">";


    String regularExpression = "xmlns:(.*?)=(\".*?\")";

    Pattern p = Pattern.compile(regularExpression);
    Matcher m = p.matcher(txt);
    if (m.find()) {
        String word1 = m.group(1);
        System.out.print("(" + word1.toString() + ")" + "\n");

    }
User27854
  • 824
  • 1
  • 16
  • 40
  • Downvote is appreciated, but please do care to mention the reason. – User27854 Aug 29 '19 at 12:25
  • 1
    No idea why you were downvoted, but you can't use regex, nor any other trivial parsing system, to correctly handle XML. As a simple thing cannot work, you'd better use a standard parser. Unless you're guaranteeing that you're using a sublanguage of XML, that only uses simple constructs and where most of XML features are absent. Also no, things like SAX or StAX aren't costly, unless you're limited to a couple hundreds of instructions, in which case don't use Java. – kumesana Aug 29 '19 at 12:29
  • In this project, we are using a Dom Parser and the XML can run to thousands of lines. So, instead of parsing the entire XML document. I would like to just take out the XML namespaces only for my processing. Its a legacy application hence the limitations. – User27854 Aug 29 '19 at 12:35
  • Well, use another parser. – kumesana Aug 29 '19 at 12:37
  • Your regex works at https://regex101.com/r/VwlSDN/2 so can you elaborate as to where your Java code is failing? – MonkeyZeus Aug 29 '19 at 12:44
  • My output is (s) (res) I am missing out rest of the other part. My regular expression was not fully correct. – User27854 Aug 29 '19 at 12:51
  • I see. That's because `m.group(1)` only accounts for the first capturing group which is `(.*?)`. You can switch to `m.group(0)` to get everything matched. – MonkeyZeus Aug 29 '19 at 13:02

1 Answers1

1

If you literally just want:

xmlns:s="http://www.way2tutorial.com/some_url1"
xmlns:res="http://www.way2tutorial.com/some_url2"

Then you can use:

xmlns:[^=]+="[^"]+"

https://regex101.com/r/VwlSDN/4

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
  • yes, perfect this will do the job for me – User27854 Aug 29 '19 at 12:45
  • `[a-z]` might miss some prefixes. I believe the prefixes should respect the "Name" grammar defined [here](https://www.w3.org/TR/REC-xml/#NT-Name) – Aaron Aug 29 '19 at 12:48
  • @Aaron Although you are right, I am not patient enough to read the spec so would you settle for `.+?`? I have no intention of building a validator, just a matcher for this use-case. I am sure that `=".+?"` could also be improved but let's just assume that there's nothing wonky in there :-) – MonkeyZeus Aug 29 '19 at 12:59
  • `[^=]+` is probably a bit more performant. I agree there's no reason to validate the XML, only to make sure you don't miss valid prefix names – Aaron Aug 29 '19 at 13:02
  • @Aaron It is more performant but only by a minuscule amount. However, `[^"]+` seems to have saved a lot of steps according regex101 – MonkeyZeus Aug 29 '19 at 13:09
  • @Aaron One more thing is that `[^=]+` will match new lines but `.+?` does not unless you use the `/s` modifier. I was trying to edit my regex at https://stackoverflow.com/a/57698506/2191572 to get rid of the lazy quantifiers but it broke things. – MonkeyZeus Aug 29 '19 at 13:18