-1

I need to parse some log files that resemble the block below..

25 Nov 2010 01:11:13 DEBUG [MSMQListenerService] 
Processing Recipient with Email : email@internet.com - 
<Envelope>
<Body>
<AddRecipient>
<LIST_ID>123456</LIST_ID>
<CREATED_FROM>1</CREATED_FROM>
<UPDATE_IF_FOUND>true</UPDATE_IF_FOUND>
<ALLOW_HTML>true</ALLOW_HTML>
<COLUMN><NAME>EMAIL</NAME><VALUE>email@internet.com</VALUE></COLUMN>
<COLUMN><NAME>AUM</NAME><VALUE>100</VALUE></COLUMN>
<COLUMN><NAME>CITY</NAME><VALUE>New York</VALUE></COLUMN>
<COLUMN><NAME>COMPANY_PROFILE</NAME><VALUE>Building</VALUE></COLUMN>
<COLUMN><NAME>COMPANY_NAME</NAME><VALUE>Company Name</VALUE></COLUMN>
<COLUMN><NAME>COUNTRY_CODE</NAME><VALUE>US</VALUE></COLUMN>
<COLUMN><NAME>FIRST_NAME</NAME><VALUE>My First Name</VALUE></COLUMN>
<COLUMN><NAME>JOB_FUNCTION</NAME><VALUE>My Job</VALUE></COLUMN>
<COLUMN><NAME>LAST_NAME</NAME><VALUE>My Last Name</VALUE></COLUMN>
<COLUMN><NAME>Plan to Buy</NAME><VALUE>Yes</VALUE></COLUMN>
<COLUMN><NAME>STATE</NAME><VALUE></VALUE>NY</COLUMN>
<COLUMN><NAME>Code VALUE</NAME><VALUE>ABCDE_000000_00_00</VALUE></COLUMN>
<COLUMN><NAME>Code Title</NAME><VALUE><![CDATA[Word%3a+Word+Word+to+Word+Words]]></VALUE></COLUMN>
<COLUMN><NAME>ZIP_CODE</NAME><VALUE>11101</VALUE></COLUMN>
<COLUMN><NAME>Form Date</NAME><VALUE>12%2f01%2f2011</VALUE></COLUMN>
</AddRecipient>
</Body>
</Envelope>

But because of the misc text I can't simply apply xsl, or cast it to an xml document. I'm thinking regex is going to be the best solution but I'm pretty shaky on my regex skills. Basically I just need what is in the Envelope. Is regex the best approach here? I also have .NET if there anything in the framework that could help here.

Thanks!

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
hardba11
  • 1,478
  • 15
  • 27
  • *[You will not parse XML with regex.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)* However, if it's only a fixed-with header, why not just strip it (if it varies a bit, regex is a solution, but then we need to know details) and parse the XML? –  Feb 07 '11 at 20:26

4 Answers4

5

This looks like a normal, well formed XML file with a couple of lines of header data. Trim off the header then parse the rest as XML as normal.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • Thanks for the quick response! It is mostly well formed but that text that is out side of the xml envelope occurs above every envelope in the document, and is slightly different on each occurrance. – hardba11 Feb 07 '11 at 20:29
  • Presumably you can discard every line until you reach one that matches ``? – Quentin Feb 07 '11 at 21:00
1

/^.*?(<Envelope>.*<\/Envelope>)/

Or if many in same document that are unnested, loop (or collect matches in an array)

while ( $text =~ /(<Envelope>.*?<\/Envelope>)/g ) {
// parse \1 as xml
}

or @envelopes = $text =~ /(<Envelope>.*?<\/Envelope>)/g

  • this was what I needed to get started. I posted my entire function below. Thank you! – hardba11 Feb 08 '11 at 01:25
  • Glad it got you started. Remember to remove the `^.*?` part (or just '^') if you expect to match multiple envelope blocks in the same document. –  Feb 08 '11 at 02:10
1

If I hear you well, then you say every document contains a couple of envelopes. In that case, you would get in trouble even if you would be able to strip of the extra text. One way to work around it might be by putting in a new starting element at the top of the file and a new end element at the bottom. That way the extra leading text is considered to the textual content in a mixed-content type of content model. You can easily process that using any of your favorite XML tool. (I would advise you download xsltproc for Windows, or seek for a Windows copy of xmlstarlet.)

Wilfred Springer
  • 10,869
  • 4
  • 55
  • 69
  • Surrounding the now text with an element is no guarantee of well-formedness. So at least a regex solution won't fail validation, even if its not 100 percent correct. –  Feb 07 '11 at 21:12
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Wilfred Springer Feb 08 '11 at 06:33
0

I used @sln's suggestion above and came up with this. It output a valid XML doc for me. I'm marking his answer as correct but thought I should show the entire usage. Thanks

static void Main(string[] args)
    {
        const string regxPattern = @"^.*?(<Envelope>.*<\/Envelope>)";

        using (var reader = new StreamReader(@"C:\pathtolog\file.log"))
        {
            var stringContent = reader.ReadToEnd();
            {
                TextWriter tw = new StreamWriter(@"C:\pathtolog\output.txt");
                // Instantiate the regular expression object.
                Regex r = new Regex(regxPattern, RegexOptions.Multiline);
                // Match the regular expression pattern against a text string.
                Match m = r.Match(stringContent);
                int matchCount = 0;
                while (m.Success) 
                {
                    //Console.WriteLine("Match"+ (++matchCount));
                    for (int i = 1; i <= 2; i++) 
                    {
                        Group g = m.Groups[i];
                        tw.WriteLine(g.Value);
                        CaptureCollection cc = g.Captures;
                        for (int j = 0; j < cc.Count; j++) 
                        {
                           Capture c = cc[j];
                           tw.WriteLine(c.Value);
                        }
                    }
                    m = m.NextMatch();
                }
                reader.Close();
                tw.Close();
            }
            Console.WriteLine("Hit Any Key to Close...");
            Console.ReadLine();
        }
    }
hardba11
  • 1,478
  • 15
  • 27