0

My question: What's a good way to parse the information below?

I have a java program that gets it's input from XML. I have a feature which will send an error email if there was any problem in the processing. Because parsing the XML could be a problem, I want to have a feature that would be able to regex the emails out of the xml (because if parsing was the problem then I couldn't get the error e-mails out of the xml normally).

Requirements:

  • I want to be able to parse the to, cc, and bcc attributes seperately
  • There are other elements which have to, cc, and bcc attributes
  • Whitespace does not matter, so my example may show the attributes on a newline, but that's not always the case.
  • The order of the attributes does not matter.

Here's an example of the xml:

<error_options
  to="your_email@your_server.com"
  cc="cc_error@your_server.com"
  bcc="bcc_error@your_server.com"
  reply_to="someone_else@their_server.com"
  from="bo_error@some_server.org"
  subject="Error running System at @@TIMESTAMP@@"
  force_send="false"
  max_email_size="10485760"
  oversized_email_action="zip;split_all"
>

I tried this error_options.{0,100}?to="(.*?)", but that matched me down to reply_to. That made me think there are probably some cases I might miss, which is why I'm posting this as a question.

kentcdodds
  • 27,113
  • 32
  • 108
  • 187
  • 2
    Do not use a regex to parse XML/HTML parse it properly and just extract the attribute/value pairs you care about – Anya Shenanigans Jul 03 '12 at 15:21
  • Well, like I said, one of the features is sending an e-mail to the user if their xml does *not* parse properly. – kentcdodds Jul 03 '12 at 15:23
  • What do you mean with "does not parse properly", that the XML parser is unable to continue reading it? – Konrad Reiche Jul 03 '12 at 15:24
  • I mean if it's a poorly formatted xml document. For example, they leave off a closing `/>`. – kentcdodds Jul 03 '12 at 15:26
  • 1
    If the xml is messed up in an unknown fashion, there's likely no way to reliably extract the information you want with a regex. You could try something like `"error_options.*?\\sreply_to=\"(.+?)\""` and hope for the best. – Jacob Raihle Jul 03 '12 at 15:33

3 Answers3

1

This question is similar to RegEx match open tags except XHTML self-contained tags. Never ever parse XML or HTML with regular expressions. There are many XML parser implementation in Java to do this task properly. Read the document and parse the attributes one by one.

Don't mind, if the users XML is not well-formed, the parsers can handle a lot of sloppiness.

Community
  • 1
  • 1
Konrad Reiche
  • 27,743
  • 15
  • 106
  • 143
1
/<error_options(?=\s)[^>]*?(?<=\n)\s*to="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*cc="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*bcc="([^"]*)"/s;
Ωmega
  • 42,614
  • 34
  • 134
  • 203
1

This piece will put all attributes from your String s="<error_options..." into a map:

    Pattern p = Pattern.compile("\\s+?(.+?)=\"(.+?)\\s*?\"",Pattern.DOTALL);
    Map a = new HashMap() ;
    Matcher m = p.matcher(s) ;
    while( m.find() ) {
        String key = m.group(1).trim() ;
        String val = m.group(2).trim() ; 
        a.put(key, val) ;
    }

...then you can extract the values that you're interested in from that map.

kentcdodds
  • 27,113
  • 32
  • 108
  • 187
mazaneicha
  • 8,794
  • 4
  • 33
  • 52