1

I am in the midst of working on a school project where I have to parse an XML file that can change in it's complexity level. All I know is the various interesting elements and their attributes that I am after. However, these values may not always exist so NULL checking is a must. During the course of my research, it seems most folks will say that when dealing with a complex XML file, it's best to deserialize the file into predefined class(es). I will provide two examples of this XML file for your reference. I will also tell you the interesting elements and attributes. What I am looking for is for someone to provide an example of how they'd parse this file in order to extract the attribute values for the purposes of editing them and writing them back to the same file. I will also include the code I have so far...

Example XML file (1) :

SampleXML1 - Simple XML File to be parsed

Example XML file (2) :

SampleXML2 - Not so simple XML file to be parsed

The interesting elements are any elements which have attributes like :

  • w:rsidR
  • wrsidRDefault
  • w:rsidP
  • w:rsidRPr
  • w:rsidTr

I currently have the following method which will parse the files and read in the attribute values and allow me to edit them in memory. However, I can't figure out how best to write this same data structure back to the attributes. Hence my research and subsequent question about XML serialization. I truly appreciate your help and input as always... Thank you!!!

My Code Snippet :

public static void shaqfu(string strMsg)
    {
        string strFile = @"C:\SourceFolder\SampleXML\document-test.xml";
        //string strFile = @"C:\SourceFolder\SampleXML\document (2).xml";
        //string strFile = @"C:\SourceFolder\SampleXML\document (3).xml";

        int index = 0;
        var i = 0;
        using (XmlReader reader = XmlReader.Create(strFile))
        {
            while (reader.Read())
            {
                if (reader.IsStartElement())
                {
                    List <string> rlist = new List<string>();

                    switch (reader.Name)
                    {
                        case "w:p":

                            string wp_rsidRAttrib = reader.GetAttribute("w:rsidR");
                            string wp_rsidRDefaultAttrib = reader.GetAttribute("w:rsidRDefault");
                            string wp_rsidPAttrib = reader.GetAttribute("w:rsidP");
                            string wp_rsidRPrAttrib = reader.GetAttribute("w:rsidRPr");
                            string wp_rsidTrAttrib = reader.GetAttribute("w:rsidTr");

                            if (wp_rsidRAttrib != null)
                            {
                                rlist.Add(wp_rsidRAttrib);
                                index++;
                            }
                            if (wp_rsidRPrAttrib != null)
                            {
                                rlist.Add(wp_rsidRPrAttrib);
                                index++;
                            }
                            if (wp_rsidRDefaultAttrib != null)
                            {
                                rlist.Add(wp_rsidRDefaultAttrib);
                                index++;
                            }
                            if (wp_rsidPAttrib != null)
                            {
                                rlist.Add(wp_rsidPAttrib);
                                index++;
                            }
                            if (wp_rsidTrAttrib != null)
                            {
                                rlist.Add(wp_rsidTrAttrib);
                                index++;
                            }
                            break;

                        case "w:r":

                            string wr_rsidRAttrib = reader.GetAttribute("w:rsidR");
                            string wr_rsidRDefaultAttrib = reader.GetAttribute("w:rsidRDefault");
                            string wr_rsidPAttrib = reader.GetAttribute("w:rsidP");
                            string wr_rsidRPrAttrib = reader.GetAttribute("w:rsidRPr");
                            string wr_rsidTrAttrib = reader.GetAttribute("w:rsidTr");

                            if (wr_rsidRAttrib != null)
                            {
                                rlist.Add(wr_rsidRAttrib);
                                index++;
                            }
                            if (wr_rsidRPrAttrib != null)
                            {
                                rlist.Add(wr_rsidRPrAttrib);
                                index++;
                            }
                            if (wr_rsidRDefaultAttrib != null)
                            {
                                rlist.Add(wr_rsidRDefaultAttrib);
                                index++;
                            }
                            if (wr_rsidPAttrib != null)
                            {
                                rlist.Add(wr_rsidPAttrib);
                                index++;
                            }
                            if (wr_rsidTrAttrib != null)
                            {
                                rlist.Add(wr_rsidTrAttrib);
                                index++;
                            }
                            break;

                        case "w:tr":

                            string wtr_rsidRAttrib = reader.GetAttribute("w:rsidR");
                            string wtr_rsidRDefaultAttrib = reader.GetAttribute("w:rsidRDefault");
                            string wtr_rsidPAttrib = reader.GetAttribute("w:rsidP");
                            string wtr_rsidRPrAttrib = reader.GetAttribute("w:rsidRPr");
                            string wtr_rsidTrAttrib = reader.GetAttribute("w:rsidTr");

                            if (wtr_rsidRAttrib != null)
                            {
                                rlist.Add(wtr_rsidRAttrib);
                                index++;
                            }
                            if (wtr_rsidRPrAttrib != null)
                            {
                                rlist.Add(wtr_rsidRPrAttrib);
                                index++;
                            }
                            if (wtr_rsidRDefaultAttrib != null)
                            {
                                rlist.Add(wtr_rsidRDefaultAttrib);
                                index++;
                            }
                            if (wtr_rsidPAttrib != null)
                            {
                                rlist.Add(wtr_rsidPAttrib);
                                index++;
                            }
                            if (wtr_rsidTrAttrib != null)
                            {
                                rlist.Add(wtr_rsidTrAttrib);
                                index++;
                            }
                            break;

                        case "w:sectPr":

                            string wsPr_rsidRAttrib = reader.GetAttribute("w:rsidR");
                            string wsPr_rsidRDefaultAttrib = reader.GetAttribute("w:rsidRDefault");
                            string wsPr_rsidPAttrib = reader.GetAttribute("w:rsidP");
                            string wsPr_rsidRPrAttrib = reader.GetAttribute("w:rsidRPr");
                            string wsPr_rsidTrAttrib = reader.GetAttribute("w:rsidTr");

                            if (wsPr_rsidRAttrib != null)
                            {
                                rlist.Add(wsPr_rsidRAttrib);
                                index++;
                            }
                            if (wsPr_rsidRPrAttrib != null)
                            {
                                rlist.Add(wsPr_rsidRPrAttrib);
                                index++;
                            }
                            if (wsPr_rsidRDefaultAttrib != null)
                            {
                                rlist.Add(wsPr_rsidRDefaultAttrib);
                                index++;
                            }
                            if (wsPr_rsidPAttrib != null)
                            {
                                rlist.Add(wsPr_rsidPAttrib);
                                index++;
                            }
                            if (wsPr_rsidTrAttrib != null)
                            {
                                rlist.Add(wsPr_rsidTrAttrib);
                                index++;
                            }
                            break;
                    }

                    foreach (string r in rlist)
                    {
                        var rValCharArray = r.ToCharArray();
                        for (var x = 2; x < rValCharArray.Length && i < strMsg.Length; x++) rValCharArray[x] = strMsg[i++];
                        Console.WriteLine(rValCharArray);
                    }
                }
            }
        }

        Console.WriteLine("Number of rsids found : {0}",index);
    }

Example XML File (1) - Actual Text

<?xml version="1.0" encoding="UTF-16" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
    <w:p w14:paraId="2CBBB1B4" w14:textId="77777777" w:rsidR="00D9548A" w:rsidRDefault="00D9548A" w:rsidP="00ED7A0B"></w:p>
    <w:p w14:paraId="2CBBB1B5" w14:textId="77777777" w:rsidR="00D9548A" w:rsidRPr="00ED77B9" w:rsidRDefault="00C706DD" w:rsidP="00D9548A"></w:p>
    <w:pPr>
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"></w:rFonts>
            <w:b></w:b>
            <w:sz w:val="40"></w:sz>
            <w:szCs w:val="40"></w:szCs>
        </w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="00EC456F"></w:r>
    <w:tr w:rsidR="0029258E" w14:paraId="2CBBB242" w14:textId="77777777" w:rsidTr="0029258E"></w:tr>
</w:body>

Example XML file (2) - Actual Text :

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
    <w:p w:rsidR="00661DE2" w:rsidRDefault="00B31FC7">
        <w:r>
            <w:t>This is a single editing session. 9:49AM</w:t>
        </w:r>
        <w:r w:rsidR="00251096">
            <w:t xml:space="preserve"> – adding more content to the first line 10:46AM</w:t>
        </w:r>
        <w:r w:rsidR="00A06ADC">
            <w:t xml:space="preserve"> – adding some more content to the original sentence. 10:49AM</w:t>
        </w:r>
        <w:bookmarkStart w:id="0" w:name="_GoBack"></w:bookmarkStart>
        <w:bookmarkEnd w:id="0"></w:bookmarkEnd>
    </w:p>
    <w:p w:rsidR="00481AA7" w:rsidRDefault="00481AA7">
        <w:r>
            <w:t>This is a second editing session. 9:56AM</w:t>
        </w:r>
    </w:p>
    <w:p w:rsidR="005C6856" w:rsidRDefault="005C6856">
        <w:r>
            <w:t>This is a third editing session. 9:58AM</w:t>
        </w:r>
    </w:p>
    <w:sectPr w:rsidR="005C6856">
        <w:pgSz w:w="12240" w:h="15840"></w:pgSz>
        <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"></w:pgMar>
        <w:cols w:space="720"></w:cols>
        <w:docGrid w:linePitch="360"></w:docGrid>
    </w:sectPr>
</w:body>

  • If your file is not all too big, you can still use ``System.Xml.XmlDocument`` to load the whole thing and then query and update it instead of using ``XmlReader``. – BitTickler Jul 22 '16 at 17:50
  • Have a look at [this](http://stackoverflow.com/questions/670563/linq-to-read-xml). Parsing XML-Files by hand is so yesterday ;) – lokusking Jul 22 '16 at 21:56
  • Your file is small and you don't need to serialize to change values. I wouldn't use your approach with XmlReader because it doesn't allow modification to values. I would use Xml Linq. If you post the xml text instead of the png images I will help. – jdweng Jul 23 '16 at 00:04
  • @jdweng - Thank you for the note about XmlReader not allowing modification. I simply wasn't aware... I will post the XML text of an example file. But, these files can get larger. I only posted two examples. – Gabriel Alicea Jul 25 '16 at 15:04
  • @jdweng - Post modified and actual text added for the XML files. I appreciate your help. Whatever code you produce, could you annotate? I want to learn what is being done as well so that I can get better. – Gabriel Alicea Jul 25 '16 at 15:15
  • @BitTickler - thanks for the comment. However, I have some XML files that are > 25MB and larger. I just posted the above as an example of smaller files to illustrate the problem I am working to solve... Nonetheless, as others commented, I wasn't aware that I couldn't use XmlReader to change values. I reckon I should've discerned that from the name alone. SMH... Newb here... – Gabriel Alicea Jul 25 '16 at 15:18
  • @lokusking Yea, I've read the same thing on several posts. I need to spend time with Linq. I guess it feels \ looks a little more abstract to me. Procedural code reads easier though it is antiquated. Thanks for your tip. I am looking forward to a code implementation of your suggestion. – Gabriel Alicea Jul 25 '16 at 15:20

1 Answers1

1

Try the code below. I used a combination of XmlReader and XML Linq. You need to use XmlReader because of the large file size. My XML Linq I used more than one technique to show a combination of ways of parsing XML.

I ran into a few issues that took time to resolve:

  1. You had encoding="UTF-16". So I used a StreamReader and skip this first line to get XmlReader to work
  2. You had lots of different namespaces so I ignored all of them and used Name Local which is the tag name (or attribute) without the namespace.
  3. Your attributes had namespaces which are not normally used.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.IO;

namespace ConsoleApplication4
{
    class Program
    {
        const string FILENAME = @"c:\temp\test2.xml";
        static void Main(string[] args)
        {
            StreamReader sReader = new StreamReader(FILENAME);
            //read line to remove xml identification which may have "UTF-16"
            sReader.ReadLine();

            XmlReader reader = XmlReader.Create(sReader);
            reader.Read();

            string ns = reader.NamespaceURI;
            reader.ReadToFollowing("body", ns);
            reader.ReadStartElement("body", ns);
            string name = "";

            while (!reader.EOF && (reader.NodeType != XmlNodeType.EndElement))
            {
                if (reader.Name == "") reader.Read();
                name = reader.Name;

                if (!reader.EOF && (reader.NodeType != XmlNodeType.EndElement))
                {
                    XElement node = (XElement)XElement.ReadFrom(reader);
                    switch (node.Name.LocalName)
                    {
                        case "p":
                            string paraId = (string)node.Attributes().Where(x => x.Name.LocalName == "paraId").FirstOrDefault();
                            string textId = (string)node.Attributes().Where(x => x.Name.LocalName == "textId").FirstOrDefault();
                            string rsidR = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidR").FirstOrDefault();
                            string rsidRDefault = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidRDefault").FirstOrDefault();
                            string rsidP = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidP").FirstOrDefault();

                            var rS = node.Descendants().Where(x => x.Name.LocalName == "r").Select(x => new {
                                rsidR = (string)x.Attributes().Where(y => y.Name.LocalName == "rsidR").FirstOrDefault(),
                                rsidRDefault = (string)x.Attributes().Where(y => y.Name.LocalName == "rsidRDefault").FirstOrDefault(),
                                t = (string)x.Descendants().Where(y => y.Name.LocalName == "t").FirstOrDefault()
                            }).ToList();

                            var bookMarkStart = node.Descendants().Where(x => x.Name.LocalName == "bookmarkStart").Select(x => new
                            {
                                id = (int)x.Attributes().Where(y => y.Name.LocalName == "id").FirstOrDefault(),
                                name = (string)x.Attributes().Where(y => y.Name.LocalName == "name").FirstOrDefault()
                            }).FirstOrDefault();

                           var bookMarkEnd = node.Descendants().Where(x => x.Name.LocalName == "bookmarkEnd").Select(x => new
                            {
                                id = (int)x.Attributes().Where(y => y.Name.LocalName == "id").FirstOrDefault(),
                                name = (string)x.Attributes().Where(y => y.Name.LocalName == "name").FirstOrDefault()
                            }).FirstOrDefault();

                            break;

                        case "pPr":
                            XElement rFonts = node.Descendants().Where(x => x.Name.LocalName == "rFonts").FirstOrDefault();
                            if (rFonts != null)
                            {
                                string ascii = (string)rFonts.Attributes().Where(x => x.Name.LocalName == "ascii").FirstOrDefault();
                                string hAnsi = (string)rFonts.Attributes().Where(x => x.Name.LocalName == "hAnsi").FirstOrDefault();
                                string cs = (string)rFonts.Attributes().Where(x => x.Name.LocalName == "cs").FirstOrDefault();

                            }
                            XElement b = node.Descendants().Where(x => x.Name.LocalName == "b").FirstOrDefault();
                            if (b != null)
                            {
                                string bVal = (string)b.Attributes().Where(x => x.Name.LocalName == "val").FirstOrDefault();

                            }
                            XElement sz = node.Descendants().Where(x => x.Name.LocalName == "sz").FirstOrDefault();
                            if (sz != null)
                            {
                                int szVal = (int)sz.Attributes().Where(x => x.Name.LocalName == "val").FirstOrDefault();
                            }
                            XElement szCs = node.Descendants().Where(x => x.Name.LocalName == "szCs").FirstOrDefault();
                            if (szCs != null)
                            {
                                int szCsVal = (int)szCs.Attributes().Where(x => x.Name.LocalName == "val").FirstOrDefault();
                            }
                            break;

                        case "r":
                                string rsidRPr = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidRPr").FirstOrDefault();
                            break;

                        case "sectPr" :
                            string sectRsidR = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidR").FirstOrDefault();
                            var pgSz = node.Descendants().Where(x => x.Name.LocalName == "pgSz").Select(x => new
                            {
                                w = (int)x.Attributes().Where(y => y.Name.LocalName == "w").FirstOrDefault(),
                                h = (int)x.Attributes().Where(y => y.Name.LocalName == "h").FirstOrDefault()
                            }).FirstOrDefault();

                            var pgMar = node.Descendants().Where(x => x.Name.LocalName == "pgMar").Select(x => new
                            {
                                top = (int)x.Attributes().Where(y => y.Name.LocalName == "top").FirstOrDefault(),
                                right = (int)x.Attributes().Where(y => y.Name.LocalName == "right").FirstOrDefault(),
                                bottom = (int)x.Attributes().Where(y => y.Name.LocalName == "bottom").FirstOrDefault(),
                                left = (int)x.Attributes().Where(y => y.Name.LocalName == "left").FirstOrDefault(),
                                header = (int)x.Attributes().Where(y => y.Name.LocalName == "header").FirstOrDefault(),
                                footer = (int)x.Attributes().Where(y => y.Name.LocalName == "footer").FirstOrDefault(),
                                gutter = (int)x.Attributes().Where(y => y.Name.LocalName == "gutter").FirstOrDefault()
                            }).FirstOrDefault();

                            var cols = node.Descendants().Where(x => x.Name.LocalName == "cols").Select(x => new
                            {
                                space = (int)x.Attributes().Where(y => y.Name.LocalName == "space").FirstOrDefault()
                            }).FirstOrDefault();

                            var docGrid = node.Descendants().Where(x => x.Name.LocalName == "docGrid").Select(x => new
                            {
                                linePitch = (int)x.Attributes().Where(y => y.Name.LocalName == "linePitch").FirstOrDefault()
                            }).FirstOrDefault();

                            break;

                        case "tr":
                            string trRsidR = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidR").FirstOrDefault();
                            string trParaId = (string)node.Attributes().Where(x => x.Name.LocalName == "paraId").FirstOrDefault();
                            string trTextId = (string)node.Attributes().Where(x => x.Name.LocalName == "textId").FirstOrDefault();
                            string tRrsidTr = (string)node.Attributes().Where(x => x.Name.LocalName == "rsidTr").FirstOrDefault();
                            break;

                        default:
                            //should not get here
                            break;
                    }
                }
            }
        }
    }
}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • thank you for taking the time to help me. I will test test this code shortly. I really have to learn Linq and lamda expressions. Still foggy on these. – Gabriel Alicea Jul 26 '16 at 17:10
  • @GabrielAlicea Fun story for you: When I first ran over Linq I wondered what the buzz was - then digged deeper and ended up learning functional programming. Now I hardly ever touch C# and migrated to F# as my main programming language. From that perspective, Linq starts to look a bit like a hack. – BitTickler Jul 26 '16 at 19:22
  • @BitTickler Thanks for the anecdote! I'm afraid the notion of functional programming is yet lost on me... from what I've read it seems to be uber abstract and a much higher level than what I'm ready to do!... I'm really beginner. But, I do like code and I enjoy seeing it work after I've struggled with it a while. But, for the life of me I couldn't figure out an elegant way to parse these XML files for the purposes of altering them... – Gabriel Alicea Jul 26 '16 at 20:51
  • @GabrielAlicea I hinted it above. Do it the straightforward way. Use ``XmlDocument``, load it and modify it. Even if you had 100MB documents - on a contemporary PC this is not a problem. – BitTickler Jul 26 '16 at 22:03