0

I need to support parsing xml that is inside an email body but with extra text in the beginning and the end.

I've tried the HTML agility pack but this does not remove the non-xml texts.

So how do I cleanse the string w/c contains an entire xml text mixed with other texts around it?

var bodyXmlPart= @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
"    <primary_applicant_data>" +
"       <first_name>Ross</first_name>" +
"       <middle_name></middle_name>" +
"       <last_name>Geller</last_name>" +
"       <ssn>123456789</ssn>" +
"    </primary_applicant_data>" +
"</ac_application> thank you, \n john ";

//How do I clean up the body xml part before loading into xml
//This will fail:
var xDoc = XDocument.Parse(bodyXmlPart);
james
  • 216
  • 4
  • 12
  • 1
    Are you certain that any instances of < and > that aren't part of the XML will be escaped? If parts of the text that aren't XML contain those characters then it will be very difficult. Otherwise, just trim everything before the first < and everything after the last >. – Scott Hannen Dec 21 '17 at 01:24
  • @ScottHannen its an email body coming from an unknown source so yes its very likely that ">" will occur at some point. Why should it be difficult can't we use regex? – james Dec 21 '17 at 01:31
  • Thinking really hard for a good answer to that... I've got nothing. I rarely use regex and it didn't even occur to me. – Scott Hannen Dec 21 '17 at 01:42
  • @james: *Why should it be difficult can't we use regex?* Because (1) [regex is fundamentally the wrong way to parse XML](https://stackoverflow.com/q/6751105/290085) and (2) you can't extract a variable structure from within an undefined context -- since you have no ability to restrict or even define what could be before or after your "XML", and you have no way of specifying how your "XML" differs from its context, you fundamentally cannot write a parser to extract the "XML". – kjhughes Dec 21 '17 at 02:24
  • The best you can do here is to treat your string as containing bad / not well-formed "XML", which is a tough problem to solve. See duplicate link for an explanation and multiple options, including several for .NET such as `XmlReader.ReadToFollowing()` or `Microsoft.Language.Xml.XMLParser`. – kjhughes Dec 21 '17 at 04:01
  • @kjhughes - I don't really see how this is a duplicate since that question asks how to parse not-well-formed XML in Java whereas OP wants to scan forward to find embedded, well-formed XML in c#. Anyway, was about to answer the question, here's a working fiddle: https://dotnetfiddle.net/GHxgWc – dbc Dec 21 '17 at 04:05
  • @james - Does the extension method `XDocumentExtensions.ParseEmbeddedDocument()` in https://dotnetfiddle.net/GHxgWc meet your needs? Should I try to make it an answer? – dbc Dec 21 '17 at 04:10
  • @dbc, haven't checked it yet but yes please include it as an answer. – james Dec 21 '17 at 05:24
  • @james - I'll need to reopen the question as a non-duplicate; OK? – dbc Dec 21 '17 at 06:37
  • Re-opened for @dbc to provide a specific answer for OP. Thanks, dbc. – kjhughes Dec 21 '17 at 13:07
  • (It can be a duplicate in that the problem of finding "XML" within a string that may also contain markup characters can be solved by treating the entire string as non-well-formed markup. However, all would be best served to have your full, detailed answer associated with this question directly.) – kjhughes Dec 21 '17 at 13:10
  • @james - Having thought some more, I'm not sure my answer would deal with scenarios where the preface text happens to contain valid embedded XML/HTML, for instance `Important! Please see below client! ....`. Do you know that you are looking for the `` tag? – dbc Dec 21 '17 at 16:38

2 Answers2

1

If you mean that body can contain any XML and not just ac_application. You can use the following code:

var bodyXmlPart = @"Hi please see below client " +
                  "<ac_application>" +
                  "    <primary_applicant_data>" +
                  "       <first_name>Ross</first_name>" +
                  "       <middle_name></middle_name>" +
                  "       <last_name>Geller</last_name>" +
                  "       <ssn>123456789</ssn>" +
                  "    </primary_applicant_data>" +
                  "</ac_application> thank you, \n john ";

 StringBuilder pattern = new StringBuilder();
 Regex regex = new Regex(@"<\?xml.*\?>", RegexOptions.Singleline);
 var match = regex.Match(bodyXmlPart);
 if (match.Success) // There is an xml declaration
 {
     pattern.Append(@"<\?xml.*");
 }
 Regex regexFirstTag = new Regex(@"\s*<(\w+:)?(\w+)>", RegexOptions.Singleline);
 var match1 = regexFirstTag.Match(bodyXmlPart);
 if (match1.Success) // xml has body and we got the first tag
 {
     pattern.Append(match1.Value.Trim().Replace(">",@"\>" + ".*"));
     string firstTag = match1.Value.Trim();
     Regex regexFullXmlBody = new Regex(pattern.ToString() + @"<\/" + firstTag.Trim('<','>') + @"\>", RegexOptions.None);
     var matchBody = regexFullXmlBody.Match(bodyXmlPart);
     if (matchBody.Success)
     {
        string xml = matchBody.Value;
     }
 }

This code can extract any XML and not just ac_application.

Assumptions are, that the body will always contain XML declaration tag. This code will look for XML declaration tag and then find first tag immediately following it. This first tag will be treated as root tag to extract entire xml.

Sunil
  • 3,404
  • 10
  • 23
  • 31
0

I'd probably do something like this...

using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Test {

    class Program {
        static void Main(string[] args) {
            var bodyXmlPart = @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
            "<ac_application>" +
            "    <primary_applicant_data>" +
            "       <first_name>Ross</first_name>" +
            "       <middle_name></middle_name>" +
            "       <last_name>Geller</last_name>" +
            "       <ssn>123456789</ssn>" +
            "    </primary_applicant_data>" +
            "</ac_application> thank you, \n john ";

            Regex regex = new Regex(@"(?<pre>.*)(?<xml>\<\?xml.*</ac_application\>)(?<post>.*)", RegexOptions.Singleline);
            var match = regex.Match(bodyXmlPart);
            if (match.Success) {
                Debug.WriteLine($"pre={match.Groups["pre"].Value}");
                Debug.WriteLine($"xml={match.Groups["xml"].Value}");
                Debug.WriteLine($"post={match.Groups["post"].Value}");
            }
        }
    }
}

This outputs...

pre=Hi please see below client 
xml=<?xml version="1.0" encoding="UTF-8"?><ac_application>    <primary_applicant_data>       <first_name>Ross</first_name>       <middle_name></middle_name>       <last_name>Geller</last_name>       <ssn>123456789</ssn>    </primary_applicant_data></ac_application>
post= thank you, 
 john 
K Johnson
  • 478
  • 5
  • 14