0

I am trying to build a data pipeline in .NET. I have been given an xsd and have used the XML Schema Definition Tool to generate C# classes that represent the object model. In order to load this data into my data store I need to transform the data coming out of the XML into structure that matches my application schema and collect/dedupe elements. To do this I have a parser class that will read a file and load the contents into local collections which will then be loaded into my database. I see two options to do this -

  1. Manually loop through the XML with an XmlReader and pull out the data I need, loading it into the local collections in stream. This is not desirable because it does not take advantage of the strongly typed/strict xsd that I was given and requires a lot of hard coding things like while (reader.Read()), check for specific XML nodes, and then `reader.GetAttribute("HardCodedString").
  2. Use XmlSerializer to deserialize the whole file at once and then loop through the deserialized collections and insert into my local collections. This is not desirable because the files could be very large and this method forces me to loop through all of the data twice (once to deserialize and once to extract the data to my local collections).

Ideally I would like some way to register a delegate to be executed as each object is deserialized to insert into my local collections. Is there something in the framework that allows me to do this? Requirements are as follows:

  1. Performant - Only loop through the data once.
  2. Functional - Data is inserted into the local collections during deserialization.
  3. Maintainable - Utilize strongly typed classes that were generated via the xsd.

I have created a minimal example to illustrate my point.

Example XML File:

<Hierarchy xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.example.com/example">
    <Children>
        <Child ChildId="1" ChildName="First">
            <Parents>
                <Parent ParentId="1" ParentName="First" RelationshipStart="1900-01-01T00:00:00"/>
                <Parent ParentId="2" ParentName="Second" RelationshipStart="2000-01-01T00:00:00"/>
            </Parents>
        </Child>
        <Child ChildId="2" ChildName="Second">
            <Parents>
                <Parent ParentId="2" ParentName="Second" RelationshipStart="1900-01-01T00:00:00"/>
                <Parent ParentId="3" ParentName="Third" RelationshipStart="2000-01-01T00:00:00"/>
            </Parents>
        </Child>
    </Children>
</Hierarchy>

Local collections I am trying to load:

public Dictionary<int, string> Parents { get; }
public Dictionary<int, string> Children { get; }
public List<Relationship> Relationships { get; }

Manual version (not maintainable and doesn't use xsd):

public void ParseFileManually(string fileName)
{
    using (var reader = XmlReader.Create(fileName))
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element && reader.Name == "Hierarchy")
            {
                while (reader.Read())
                {
                    if (reader.NodeType == XmlNodeType.Element && reader.Name == "Child")
                    {
                        int childId = int.Parse(reader.GetAttribute("ChildId"));
                        string childName = reader.GetAttribute("ChildName");
                        Children[childId] = childName;

                        while (reader.Read())
                        {
                            if (reader.NodeType == XmlNodeType.Element && reader.Name == "Parent")
                            {
                                int parentId = int.Parse(reader.GetAttribute("ParentId"));
                                string parentName = reader.GetAttribute("ParentName");
                                DateTime relationshipStart = DateTime.Parse(reader.GetAttribute("RelationshipStart"));
                                
                                Parents[parentId] = parentName;
                                Relationships.Add(
                                    new Relationship{
                                        ParentId = parentId,
                                        ChildId = childId,
                                        Start = relationshipStart
                                    });
                            }
                            else if (reader.NodeType == XmlNodeType.EndElement && reader.Name == "Child")
                            {
                                break;
                            }
                        }
                    }
                }
            }
        }
    }
}

Deserialize version (loops through the data twice):

public void ParseFileWithDeserialize(string fileName)
{
    var serializer = new XmlSerializer(typeof(Hierarchy));
    using (var fileStream = new FileStream(fileName, FileMode.Open))
    {
        var fileData = (Hierarchy) serializer.Deserialize(fileStream);
        foreach (var child in fileData.Children)
        {
            Children[child.ChildId] = child.ChildName;

            foreach (var parent in child.Parents)
            {
                Parents[parent.ParentId] = parent.ParentName;
                Relationships.Add(
                    new Relationship
                    {
                        ParentId = parent.ParentId,
                        ChildId = child.ChildId,
                        Start = parent.RelationshipStart
                    });
            }
        }
    }
}
Robin Zimmerman
  • 593
  • 1
  • 6
  • 17
  • This looks like a duplicated of https://stackoverflow.com/questions/27365029/deserializing-xml-with-namespace-and-multiple-nested-elements – Stuart Smith Oct 27 '20 at 23:41
  • @StuartSmith why do you think this is a duplicate of that? The linked question seems to be a problem with the deserialization code not working at all due to a bug in the OP's code. In my question the code I have provided works fine, rather I'm trying to understand if there is an efficient way to accomplish what I am trying to do. – Robin Zimmerman Oct 28 '20 at 01:16
  • I think you are taking the wrong approach. First you do not have a schema so to use the xsd.exe you would need to create a schema. Second you real objective is to load a database so the first step is to design the tables in the database. Once you design the tables then you should parse the xml based on the structure of the database. Tables in a database are two dimensional and an xml file has more than two layers. When you design a database there are two objectives 1) Speed 2) Size. And there is a tradeoff between the two. – jdweng Oct 28 '20 at 01:21
  • You can always create a database with one table but then you end up duplicating data. For example if you have a parent with 4 children. One column is the parent name which would be duplicated for each of the 4 children. So instead if you a parent table and child table and then parent name is only entered once. But then you would need to join the parent and child data to get relationships which take more time to exact the data. – jdweng Oct 28 '20 at 01:25
  • @jdweng as mentioned in the question I do have an xsd, I just didn't include it in the question for brevity. In addition, the database and schema is already created. The reason I need to load into the specific local collections specified is that my parser class is a child class of a generic data loader that will take those collections and perform the necessary business logic before loading into the database. – Robin Zimmerman Oct 28 '20 at 02:19
  • 1
    Give database structure. It is extra work to do in two steps. If you use serialization the you first create the c# classes that are organized in the xml structure and then have to convert to the database structure. It is more efficient to do in one step and parse xml so it looks like the database structure. – jdweng Oct 28 '20 at 09:50
  • @jdweng For the purposes of the question assume that the database contains a table for parents, a table for children and a table for relationships each matching the classes above. The reason I ask the question in the way I did is because this fits into a generic data pipeline. The class that parses the xml into the local collections is the child of a class that will take those collections and load them to the database (applying necessary intermediate business logic). The parsing should be independent of the database. – Robin Zimmerman Oct 31 '20 at 00:44

1 Answers1

-1

You should use some annotations to get the data from the correct field in the XML, if you use these definitions;

public class Hierarchy
{ 
    public Hierarchy()
    {
        Children = new List<Child>();
    }
    public List<Child> Children { get; set; }
}
public class Child
{
    public Child()
    {
        Parents = new List<Parent>();
    }
    [XmlAttribute("ChildId")]
    public int ChildId { get; set; }
    [XmlAttribute("ChildName")]
    public string ChildName { get; set; }
    public List<Parent> Parents { get; set; }
}
public class Parent
{
    [XmlAttribute("ParentId")]
    public int ParentId { get; set; }
    [XmlAttribute("ParentName")]
    public string ParentName { get; set; }
    [XmlAttribute("RelationshipStart")]
    public DateTime RelationshipStart { get; set; }
}

Then you should be able to simplify your code to;

    public static Hierarchy Deserialize(string fileName)
    {
        using (var fileStream = new StreamReader(fileName, Encoding.UTF8))
        {
            XmlSerializer ser = new XmlSerializer(typeof(Hierarchy));
            return (Hierarchy)ser.Deserialize(fileStream);
        }
    }

To test it out you can create a sample data set and serialize it to a file, then use the above code to read it back

    public static void Serialize(Hierarchy h, string fileName)
    {
        System.Xml.Serialization.XmlSerializer ser = new System.Xml.Serialization.XmlSerializer(typeof(Hierarchy));
        StreamWriter sw = new StreamWriter(fileName, false, Encoding.UTF8);
        ser.Serialize(sw, h);
    }

Test Code

    static void Test()
    {
        Hierarchy h = new Hierarchy();
        Parent p1 = new Parent() { ParentId = 1, ParentName = "First", RelationshipStart = DateTime.Now };
        Parent p2 = new Parent() { ParentId = 2, ParentName = "Second", RelationshipStart = DateTime.Now };
        Parent p3 = new Parent() { ParentId = 3, ParentName = "Third", RelationshipStart = DateTime.Now };
        Child c1 = new Child() { ChildId = 1, ChildName = "First" };
        c1.Parents.Add(p1);
        c1.Parents.Add(p2);
        Child c2 = new Child() { ChildId = 2, ChildName = "Second" };
        c2.Parents.Add(p2);
        c2.Parents.Add(p3);
        h.Children.Add(c1);
        h.Children.Add(c2);
        Serialize(h, AppContext.BaseDirectory + "Text.xml");
        Hierarchy hReadBack = Deserialize(AppContext.BaseDirectory + "Text.xml");
    }

Edit : To answer your question

Use these classes

public class Hierarchy
{ 
    public Hierarchy()
    {
        Children = new List<Child>();
    }
    public List<Child> Children { get; set; }

    private Dictionary<int, string> _parents;
    private Dictionary<int, string> _childrenList;
    private List<Relationship> _relationships;
    private void CalcuateLists()
    {
        _parents = new Dictionary<int, string>();
        _childrenList = new Dictionary<int, string>();
        _relationships = new List<Relationship>();
        foreach (Child c in this.Children)
        {
            if (!_childrenList.ContainsKey(c.ChildId))
            {
                _childrenList.Add(c.ChildId, c.ChildName);
            }
            foreach (Parent p in c.Parents)
            {
                if (!_parents.ContainsKey(p.ParentId))
                {
                    _parents.Add(p.ParentId, p.ParentName);
                }
                if (_relationships.FirstOrDefault(dat => dat.ParentId == p.ParentId && dat.ChildId == c.ChildId) == null)
                {
                    _relationships.Add(new Relationship() { ChildId = c.ChildId, ParentId = p.ParentId, Start = p.RelationshipStart });
                }
            }
        }
    }
    public Dictionary<int, string> Parents { 
        get
        {
            if (_parents == null)
                CalcuateLists();
            return _parents;
        }
    }
    public Dictionary<int, string> ChildrenList {
        get
        {
            if (_childrenList == null)
                CalcuateLists();
            return _childrenList;
        }
    }
    public List<Relationship> Relationships { 
        get
        {
            if (_relationships == null)
                CalcuateLists();
            return _relationships;
        }
    }
}
public class Child
{
    public Child()
    {
        Parents = new List<Parent>();
    }
    [XmlAttribute("ChildId")]
    public int ChildId { get; set; }
    [XmlAttribute("ChildName")]
    public string ChildName { get; set; }
    public List<Parent> Parents { get; set; }
}
public class Parent
{
    [XmlAttribute("ParentId")]
    public int ParentId { get; set; }
    [XmlAttribute("ParentName")]
    public string ParentName { get; set; }
    [XmlAttribute("RelationshipStart")]
    public DateTime RelationshipStart { get; set; }
}

Then your test code becomes

public static void Test()
{
    Hierarchy h = new Hierarchy();
    Parent p1 = new Parent() { ParentId = 1, ParentName = "First", RelationshipStart = DateTime.Now };
    Parent p2 = new Parent() { ParentId = 2, ParentName = "Second", RelationshipStart = DateTime.Now };
    Parent p3 = new Parent() { ParentId = 3, ParentName = "Third", RelationshipStart = DateTime.Now };
    Child c1 = new Child() { ChildId = 1, ChildName = "First" };
    c1.Parents.Add(p1);
    c1.Parents.Add(p2);
    Child c2 = new Child() { ChildId = 2, ChildName = "Second" };
    c2.Parents.Add(p2);
    c2.Parents.Add(p3);
    h.Children.Add(c1);
    h.Children.Add(c2);
    Serialize(h, AppContext.BaseDirectory + "Text.xml");
    Hierarchy hReadBack = Deserialize(AppContext.BaseDirectory + "Text.xml");

    Dictionary<int, string> Parents = hReadBack.Parents;
    Dictionary<int, string> Children = hReadBack.ChildrenList;
    List<Relationship> Relationships = hReadBack.Relationships;
}

EDIT

To get the results directly without looping

You will need this class

public class Relationship
{
    public int ParentId { get; set; }
    public int ChildId { get; set; }
    public DateTime Start { get; set; }
}  

And this selection

// Get a list of child ids and names
Dictionary<int, string> Children = (from c in hReadBack.Children select new { ChildId = c.ChildId, Name = c.ChildName}).ToDictionary(dat => dat.ChildId, dat => dat.Name);
// Get a parent ids and names
Dictionary<int, string> Parents = (from p in hReadBack.Children.SelectMany(i => i.Parents) select new { ParentId = p.ParentId, Name = p.ParentName }).Distinct().ToDictionary(dat => dat.ParentId, dat => dat.Name);
// Get the relationships
List<Relationship> Relationship = (from Child c in hReadBack.Children from Parent p in c.Parents select new Relationship() { ChildId = c.ChildId, ParentId = p.ParentId, Start = p.RelationshipStart }).ToList();
Xavier
  • 1,383
  • 7
  • 6
  • This does not answer the question. The code in the question works to deserialize into a `Hierarchy` object, the question was about how to efficiently deserialize into the local derived dictionaries. – Robin Zimmerman Oct 28 '20 at 01:13
  • You could make if more efficient by calculating all three lists in one go an putting the resulting list into local variables in the Hierarchy class, then have the gets return the local variables. – Xavier Oct 29 '20 at 01:51
  • this still does not meet the requirements posed in the question. ForEach child in this.Children still involves looping through the data twice which violates requirement 1 from the question. – Robin Zimmerman Oct 31 '20 at 00:36