Parsing invalid XML

Question

I need to parse some log files that resemble the block below..

25 Nov 2010 01:11:13 DEBUG [MSMQListenerService] 
Processing Recipient with Email : email@internet.com - 
<Envelope>
<Body>
<AddRecipient>
<LIST_ID>123456</LIST_ID>
<CREATED_FROM>1</CREATED_FROM>
<UPDATE_IF_FOUND>true</UPDATE_IF_FOUND>
<ALLOW_HTML>true</ALLOW_HTML>
<COLUMN><NAME>EMAIL</NAME><VALUE>email@internet.com</VALUE></COLUMN>
<COLUMN><NAME>AUM</NAME><VALUE>100</VALUE></COLUMN>
<COLUMN><NAME>CITY</NAME><VALUE>New York</VALUE></COLUMN>
<COLUMN><NAME>COMPANY_PROFILE</NAME><VALUE>Building</VALUE></COLUMN>
<COLUMN><NAME>COMPANY_NAME</NAME><VALUE>Company Name</VALUE></COLUMN>
<COLUMN><NAME>COUNTRY_CODE</NAME><VALUE>US</VALUE></COLUMN>
<COLUMN><NAME>FIRST_NAME</NAME><VALUE>My First Name</VALUE></COLUMN>
<COLUMN><NAME>JOB_FUNCTION</NAME><VALUE>My Job</VALUE></COLUMN>
<COLUMN><NAME>LAST_NAME</NAME><VALUE>My Last Name</VALUE></COLUMN>
<COLUMN><NAME>Plan to Buy</NAME><VALUE>Yes</VALUE></COLUMN>
<COLUMN><NAME>STATE</NAME><VALUE></VALUE>NY</COLUMN>
<COLUMN><NAME>Code VALUE</NAME><VALUE>ABCDE_000000_00_00</VALUE></COLUMN>
<COLUMN><NAME>Code Title</NAME><VALUE><![CDATA[Word%3a+Word+Word+to+Word+Words]]></VALUE></COLUMN>
<COLUMN><NAME>ZIP_CODE</NAME><VALUE>11101</VALUE></COLUMN>
<COLUMN><NAME>Form Date</NAME><VALUE>12%2f01%2f2011</VALUE></COLUMN>
</AddRecipient>
</Body>
</Envelope>

But because of the misc text I can't simply apply xsl, or cast it to an xml document. I'm thinking regex is going to be the best solution but I'm pretty shaky on my regex skills. Basically I just need what is in the Envelope. Is regex the best approach here? I also have .NET if there anything in the framework that could help here.

Thanks!

*[You will not parse XML with regex.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)* However, if it's only a fixed-with header, why not just strip it (if it varies a bit, regex is a solution, but then we need to know details) and parse the XML? — , Feb 07 '11 at 20:26

score 5 · Answer 1 · answered Feb 07 '11 at 20:22

5

This looks like a normal, well formed XML file with a couple of lines of header data. Trim off the header then parse the rest as XML as normal.

answered Feb 07 '11 at 20:22

Quentin

914,110
126
1,211
1,335

Thanks for the quick response! It is mostly well formed but that text that is out side of the xml envelope occurs above every envelope in the document, and is slightly different on each occurrance. – hardba11 Feb 07 '11 at 20:29
Presumably you can discard every line until you reach one that matches ``? – Quentin Feb 07 '11 at 21:00

score 1 · Accepted Answer · 2011-02-07T22:03:08.377

1

/^.*?(<Envelope>.*<\/Envelope>)/

Or if many in same document that are unnested, loop (or collect matches in an array)

while ( $text =~ /(<Envelope>.*?<\/Envelope>)/g ) {
// parse \1 as xml
}

or @envelopes = $text =~ /(<Envelope>.*?<\/Envelope>)/g

edited Feb 07 '11 at 22:03

answered Feb 07 '11 at 20:31

this was what I needed to get started. I posted my entire function below. Thank you! – hardba11 Feb 08 '11 at 01:25
Glad it got you started. Remember to remove the `^.*?` part (or just '^') if you expect to match multiple envelope blocks in the same document. – Feb 08 '11 at 02:10

score 1 · Answer 3 · answered Feb 07 '11 at 20:36

1

If I hear you well, then you say every document contains a couple of envelopes. In that case, you would get in trouble even if you would be able to strip of the extra text. One way to work around it might be by putting in a new starting element at the top of the file and a new end element at the bottom. That way the extra leading text is considered to the textual content in a mixed-content type of content model. You can easily process that using any of your favorite XML tool. (I would advise you download xsltproc for Windows, or seek for a Windows copy of xmlstarlet.)

answered Feb 07 '11 at 20:36

Wilfred Springer

10,869
4
55
69

Surrounding the now text with an element is no guarantee of well-formedness. So at least a regex solution won't fail validation, even if its not 100 percent correct. – Feb 07 '11 at 21:12
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Wilfred Springer Feb 08 '11 at 06:33

score 0 · Answer 4 · answered Feb 08 '11 at 01:24

I used @sln's suggestion above and came up with this. It output a valid XML doc for me. I'm marking his answer as correct but thought I should show the entire usage. Thanks

static void Main(string[] args)
    {
        const string regxPattern = @"^.*?(<Envelope>.*<\/Envelope>)";

        using (var reader = new StreamReader(@"C:\pathtolog\file.log"))
        {
            var stringContent = reader.ReadToEnd();
            {
                TextWriter tw = new StreamWriter(@"C:\pathtolog\output.txt");
                // Instantiate the regular expression object.
                Regex r = new Regex(regxPattern, RegexOptions.Multiline);
                // Match the regular expression pattern against a text string.
                Match m = r.Match(stringContent);
                int matchCount = 0;
                while (m.Success) 
                {
                    //Console.WriteLine("Match"+ (++matchCount));
                    for (int i = 1; i <= 2; i++) 
                    {
                        Group g = m.Groups[i];
                        tw.WriteLine(g.Value);
                        CaptureCollection cc = g.Captures;
                        for (int j = 0; j < cc.Count; j++) 
                        {
                           Capture c = cc[j];
                           tw.WriteLine(c.Value);
                        }
                    }
                    m = m.NextMatch();
                }
                reader.Close();
                tw.Close();
            }
            Console.WriteLine("Hit Any Key to Close...");
            Console.ReadLine();
        }
    }

Parsing invalid XML

4 Answers4