0

I need to separate the information below and only take the XML out, I'm trying to figure out the most efficient way to do this. I'm not sure what approach to start with in regards to removing the first 3 lines, and getting the XML DTD. In my head I was thinking that the best approach to this would be to ignore/remove the 3 lines until XML tags are opened with '<' but I wasn't sure if I should put that in a giant string? Honestly anything would be helpful, I'm stuck figuring this bad boy out, and I'm sure it's not going to be as hard as I'm making it out to be, but I am stuck. Thank you very much!

EDIT: This is a .log file

This is the sample Text Document:

VCS (1.0.11.111): [10/9/2015 12:00:02 AM]
POST https://ex.sample.com/samp/x/sample
Content-Type: application/x-www-form-urlencoded
<?xml version="1.0" encoding="UTF-8"?>
    <command name="sample name_" signature="some stuff" address="sample.com">
    <param name="CurrentVersion">1111</param>
    <param name="MotherboardName">Dell Inc. PowerEdge R420</param>
</command>
HTTP/1.1 200 OK
  • It looks like you are using the wrong property from your html document. You are probably using the OuterXML instead of the Body Innertext. the first 3 lines of text are the HTML Header and the last line is the status of the HTML. – jdweng Oct 13 '15 at 13:45

2 Answers2

0

The easiest way would be to get the first index of < and the last index of > substring your file and let the .Net Xml Parser do its work.

But I am not sure if it is the fastest way.

XML Parsing to class has been answered here

Community
  • 1
  • 1
JoeJoe87577
  • 512
  • 3
  • 17
  • Just a thought, Maybe it would be faster to split the file to lines, and check for each line StartsWith('<'). This way you won't iterate though all chars in the text file – Y.S Oct 13 '15 at 07:32
  • @Y.S Yeah you're right. But if he knows that there is always a `HTML Response Code` at the bottom of the page he could iterate the file from the end to find the last index and maybe this would be faster. – JoeJoe87577 Oct 13 '15 at 07:41
  • @JoeJoe87577 I was thinking about doing it this way, or the way Y.S was saying to do, it's just a matter of speed, I will try both ways and let you know how this goes! – donwoncruton Oct 13 '15 at 07:49
0

What about using some regular expression? Try this:

        var regex = new Regex(@"<\?xml.*\?>(?<Xml>.*)HTTP/", RegexOptions.Singleline);

        var match = regex.Match(inputString);

        if (match.Success)
        {
            var xmlResult = match.Groups["Xml"].Value;
        }

You will have all xml in the variable xmlResult.

Fischermaen
  • 12,238
  • 2
  • 39
  • 56
  • This worked successfully, could you explain what the string stands for at the beginning? I think I understand it, but I was hoping for an explanation behind it so I can implement this through the whole file. – donwoncruton Oct 13 '15 at 08:26
  • '<\?xml.*\?>' forces regex to look for a string beginning with ''. '(?.*)' marks a group with the name "Xml" in which any characters are allowed. 'HTTP/' forces regex to find that as a stop to put characters in the group "Xml". Hope that explenation helped. – Fischermaen Oct 14 '15 at 10:36