Dealing with awkward XML layout in c# using XmlTextReader

Question

so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:

    <product> 
        <sku>27939</sku> 
        <product_name>
            Sof-Therm Warm-Up Jacket
        </product_name> 
        <supplier_number>ALNN1064</supplier_number> 
    </product>

My code to try to sort the XML document is as such:

while (reader.Read())
            {
                switch (reader.Name)
                {
                    case "sku":
                        newEle = new XMLElement();
                        newEle.SKU = reader.ReadString();
                        break;
                    case "product_name":
                        newEle.ProductName = reader.ReadString();
                        break;
                    case "supplier_number":
                        newEle.SupplierNumber = reader.ReadString();
                        products.Add(newEle);
                        break;
                }
            }

I have tried almost everything I found in the XmlTextReader documentation

reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();

and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?

I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.

Thanks in advanced!

Are you sure you are not getting correct result with new lines embedded? It could apear as empty if you are displaying it somwhere in UI. — ghord, May 09 '13 at 13:41
Sorry, do you mean that even if it displays empty in the ui it could still be the right value? I have it filling to a table later and it's blank, probably should have mentioned that — noneabove, May 09 '13 at 13:46
It is possible. If it start with new line character ('\n'), it could be shown as empty in table in some UIs. — ghord, May 09 '13 at 13:47

score 2 · Accepted Answer · answered May 09 '13 at 13:40

2

I think you will find Linq To Xml easier to use

var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);

int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");

You can also convert your xml to dictionary

var dict = xDoc.Root.Elements()
           .ToDictionary(e => e.Name.LocalName, e => (string)e);

Console.WriteLine(dict["sku"]);

answered May 09 '13 at 13:40

I4V

34,891
6
67
79

I'll have a look to see how to set up Linq and how easily I can learn it, I'd rather use XmlTextReader but that's definitely an option I will consider, thanks – noneabove May 09 '13 at 13:49
@noneabove No need to set up Linq. It is already available in .NET framework since `3.5`. Just add `using System.Xml.Linq` – I4V May 09 '13 at 13:50

Ryan · Answer 2 · 2013-05-09T16:53:51.273

1

It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have

    <!-- 1. Original example -->
    <product_name>
        Sof-Therm Warm-Up Jacket
    </product_name>

    <!-- 2. It should probably be. If possible correct the XML generator. -->
    <product_name>Sof-Therm Warm-Up Jacket</product_name>

    <!-- 3a. If white space is important, then preserve it -->
    <product_name xml:space='preserve'>
        Sof-Therm Warm-Up Jacket
    </product_name>

    <!-- 3b. If White space is important, use CDATA -->
    <product_name>!<[CDATA[
        Sof-Therm Warm-Up Jacket
    ]]></product_name>

The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:

reader.WhitespaceHandling = WhitespaceHandling.None;

An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:

string TrimCrLf(string value)
{
    return Regex.Replace(value, @"^[\r\n\t ]+|[\r\n\t ]+$", "");
}

    // Then in your loop...
    case "product_name":
       // Trim the contents of the 'product_name' element to remove extra returns
       newEle.ProductName = TrimCrLf(reader.ReadString());
       break;

You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:

public static class StringExtensions
{
    public static string TrimCrLf(this string value)
    {
        return Regex.Replace(value, @"^[\r\n\t ]+|[\r\n\t ]+$", "");
    }
}

// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();

Regular expression explanation:

^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)

edited May 09 '13 at 16:53

answered May 09 '13 at 14:18

Ryan

7,835
2
29
36

I don't think using regex would be a good idea when real xml parsers exist. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – I4V May 09 '13 at 14:24
@I4V the issue is the text inside an xml element has extra white space. I am not suggesting to use regex to parse xml, just to trim the text of an element. I also tested the Linq to Xml version, and it still returned the extra white space. I like Linq to Xml better myself, but the problem here (if I understand correctly) is the white space. – Ryan May 09 '13 at 14:28
Ryan, first use Linq2Xml, then you can `Trim` the value. Not before parsing the xml. – I4V May 09 '13 at 14:30
Problem is (as far as I know) XML doesn't deal with platform-specific newline characters, so dealing with these to normalize the _text_ input before parsing as XML makes sense. Why would you refer to such a bizzare question/response page anyway? :) – SteveM May 09 '13 at 14:30
@SteveM the problem is the creator of the XML put extra characters (CR/LF/tab/space/etc) in an element. The best solution is to fix the XML creation. Assuming that isn't possible, then you trim the element text. You can parse the XML with XmlTextReader, XmlDocument, or XDocument as necessary. I prefer XDocument, but XmlTextReader is a light-weight solution requiring fewer resources. You would have the same issue if you put CR/LF before/after text that you saved in a database. – Ryan May 09 '13 at 14:40
Hmm...I'm not sure whitespace in the element is the problem, it certainly shouldn't be. I'm guessing it's the content of the whitespace, specifically reversing \r\n to \n\r. – SteveM May 09 '13 at 14:43
@SteveM in the example the OP gave, the text/value for is on a new line and indented. I am assuming the OP just wants the text without the new lines and indentation. Most of the time, the white space before and after the text is unnecessary and causes issues. – Ryan May 09 '13 at 14:54

SteveM · Answer 3 · 2013-05-09T14:45:31.327

0

I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:

         "^[\r\n]+|[\r\n]+$"

edited May 09 '13 at 14:45

answered May 09 '13 at 14:26

SteveM

305
1
2
12

Dealing with awkward XML layout in c# using XmlTextReader

3 Answers3