1

My Problem: I am parsing a bunch of XML based logs (which I have little control over) into MySQL statements to switch over from an XML based database to MySQL. This bit has me stumped.

If I look at the IEnumerable<XElement> that contains the string I'm interested in, I can see the embedded XML statement. However, if I take the value of that string, the XML statement disappears. EG:

IEnumerable (<PowerFail /> is visible):

<StepDetails>Set input voltage to 2.80V WDT should allow CPU power.  CPU should detect PowerFail signal and output a<PowerFail /> tag to the serial line.  WDT should reset every 1.6 seconds</StepDetails>

And taking the value, the <PowerFail /> tag is missing from the string:

Set input voltage to 2.80V WDT should allow CPU power.  CPU should detect PowerFail signal and output a tag to the serial line.  WDT should reset every 1.6 seconds

I get the same thing if I do a .ToString()

Procedure: If you paste the following into LinqPad as C# Statements, you can see what I mean. The XML tag <PowerFail /> disappears. I noticed it also disappears in here unless I place back ticks around it. I've included the LinqPad tag because that's how I'm parsing these files (there are tens of thousands of log files going back years) using a series of LinqPad scripts to process the logs into MySQL and insert them to create the new database.

My Question: I realize I can get the string out with some regex or substring or something, but it seems like I should be able to get the whole string, tags & all from the IEnumerable, but how to do so? Also, I'm curious to know why the tag is swallowed just for my edification.

I have roughly three dozen variants of these types of log anomalies affecting the tens of thousands of logs (last one I fixed yesterday applied to 1500+ logs alone) across seven years or so of data, so I'd like to find a (more) generic solution instead of an XML tag specific regex, substring or whatever for each of them. I can't change the logs, and I don't want to lose data while transferring to the new database.

To View the Problem Firsthand: Cut & Paste into LinqPAD as C# Statements (is there an online way to do this similar to JSFiddle for JavaScript)? I've added a regex solution to the bottom in case someone comes looking for something like that, but I'm still interested in a better way to do it.

string xml = @"<StepResults>
<TestStep Name='2.8V OPERATION' Result='Pass'>
    <OperatorComment/>
    <StepDetails>Set input voltage to 2.80V WDT should allow CPU power.  CPU should detect PowerFail signal and output a<PowerFail/> tag to the serial line.  WDT should reset every 1.6 seconds</StepDetails>
    <Measurements NumberOfMeasurements='1'>
        <Measurement Name='BATTERY VOLTAGE: VOLTS'>
            <MeasuredValue>2.794608</MeasuredValue>
            <Min>2.785000</Min>
            <Max>2.800000</Max>
        </Measurement>
    </Measurements>
</TestStep>
</StepResults>";
var xd = XDocument.Parse(xml);
Console.WriteLine(xd);

var xe = 
    from e in xd.Descendants("StepDetails")
    select e;
Console.WriteLine(xe);
Console.WriteLine(xe.First().Value);

//new code below to show a working regex solution:

string stepDetail = xe.First().ToString();
Regex matchFrontTag = new Regex("^<[^>]*>");
Regex matchRearTag = new Regex("<[^>]*>$");

stepDetail = matchFrontTag.Replace(stepDetail,string.Empty);
stepDetail = matchRearTag.Replace(stepDetail,string.Empty);

Console.WriteLine(stepDetail);
Frank van Puffelen
  • 565,676
  • 79
  • 828
  • 807
delliottg
  • 3,950
  • 3
  • 38
  • 52

1 Answers1

1

As the MSDN documentation for XElement.Value says:

Gets or sets the concatenated text contents of this element.

So XElement.Value will indeed only return text nodes and will (in the case of mixed content) ignore non-text nodes (but not the text nodes contained in them).

You're looking for the inner XML of the XElement, which you can get using an XmlReader.

// this writes only the (concatenated) text nodes
Console.WriteLine(xe.First().Value);

// this writes the inner XML, including elements
var reader = xe.First().CreateReader();
reader.MoveToContent();
Console.WriteLine(reader.ReadInnerXml());

If you'd prefer to stay in LINQ, you can simply join the string representation of all child nodes:

Console.WriteLine(
  xe.First().Nodes().Aggregate("", (result, node) => result += node.ToString())
);

Or

string.Join("", xe.First().Nodes().Select(n => n.ToString())).Dump();

But as the linked question says: these are a lot slower than using a reader.

Frank van Puffelen
  • 565,676
  • 79
  • 828
  • 807
  • Thanks for the help Frank, I'm using your first LINQ one-liner for the time being. If that proves to be a time hit I'll investigate the reader instead. – delliottg Aug 12 '14 at 19:41