0

I have this regex based XML validator that I would like to use for recognizing XML string. Say, I have the following XML String,

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<molecules>
  <molecule id="1">
    <atoms>
      <atom id="1" symbol="C"/>
      <atom id="2" symbol="C"/>
      <atom id="3" symbol="N"/>
    </atoms>
    <bonds>
      <bond atomAId="1" atomBId="2" id="1" order="SINGLE"/>
      <bond atomAId="2" atomBId="3" id="2" order="DOUBLE"/>
    </bonds>
  </molecule>
</molecules>

I use the following validator for the XML,

public static boolean isValidXML(String inXMLStr) {

        boolean retBool = false;
        Pattern pattern;
        Matcher matcher;

        // REGULAR EXPRESSION TO SEE IF IT AT LEAST STARTS AND ENDS
        // WITH THE SAME ELEMENT
        final String XML_PATTERN_STR = "<(\\S+?)(.*?)>(.*?)</\\1>";

        // IF WE HAVE A STRING
        if (inXMLStr != null && inXMLStr.trim().length() > 0) {

            // IF WE EVEN RESEMBLE XML
            if (inXMLStr.trim().startsWith("<")) {

                pattern = Pattern.compile(XML_PATTERN_STR,
                        Pattern.CASE_INSENSITIVE | Pattern.DOTALL | Pattern.MULTILINE);

                // RETURN TRUE IF IT HAS PASSED BOTH TESTS
                matcher = pattern.matcher(inXMLStr);
                retBool = matcher.matches();
            }
            // ELSE WE ARE FALSE
        }

        return retBool;
    } 

However, the methods returns false even for the valid XML as well. How do I correct the isValidXML method?

Arefe
  • 11,321
  • 18
  • 114
  • 168
  • 1
    You have provided a sample XML but there are so many variations that could happen in XML that it is difficult to get what you want with regular expressions. Will all your input XMLs be tags with no content in them? The hardest part is what do you want to do when tags come inside of a tag because your regular expression shows that you want to validate tag endings. More advanced XML validation can be done with SAX Parser and XSDs. – ProgrammersBlock Jan 04 '17 at 12:46
  • All of my XML will be similar format. How can I change the code now? – Arefe Jan 04 '17 at 13:01
  • Why do you want to reinvent the wheel? Use a XML parser. It will throw an error if the XML is not well formed. If this is for learning purposes you should learn that it is a bad idea to want to parse XML with regular expressions. See [Can you provide some examples of why it is hard to parse XML and HTML with a regex](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg). – vanje Jan 04 '17 at 21:44

1 Answers1

1

Well if I'm not wrong this should work:

((<(\\S(.*?))(\\s.*?)?>(.*?)<\\/\\3>)|(<\\S(.*?)(.*?)(\\/>)))

I just tested it using this site: https://regex101.com/ (for further tests ;) ) and added the java escape backslashes.

I basically just escaped the forwardslash in the closingtag for regex and grouped the whole first content of the tag, so the \1 reffers to the whole thing. If something doesnt work just let me know :)

Edit: changed it to let it check tags with arguments as well

Edit: after all the editing it got quite messy, It's probably possible to make this look better but it works this way as far as I can tell

Patrick Malik
  • 185
  • 1
  • 11