I am trying to build a data pipeline in .NET. I have been given an xsd
and have used the XML Schema Definition Tool to generate C# classes that represent the object model. In order to load this data into my data store I need to transform the data coming out of the XML into structure that matches my application schema and collect/dedupe elements. To do this I have a parser class that will read a file and load the contents into local collections which will then be loaded into my database. I see two options to do this -
- Manually loop through the XML with an
XmlReader
and pull out the data I need, loading it into the local collections in stream. This is not desirable because it does not take advantage of the strongly typed/strictxsd
that I was given and requires a lot of hard coding things likewhile (reader.Read())
, check for specific XML nodes, and then `reader.GetAttribute("HardCodedString"). - Use
XmlSerializer
to deserialize the whole file at once and then loop through the deserialized collections and insert into my local collections. This is not desirable because the files could be very large and this method forces me to loop through all of the data twice (once to deserialize and once to extract the data to my local collections).
Ideally I would like some way to register a delegate to be executed as each object is deserialized to insert into my local collections. Is there something in the framework that allows me to do this? Requirements are as follows:
- Performant - Only loop through the data once.
- Functional - Data is inserted into the local collections during deserialization.
- Maintainable - Utilize strongly typed classes that were generated via the
xsd
.
I have created a minimal example to illustrate my point.
Example XML File:
<Hierarchy xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.example.com/example">
<Children>
<Child ChildId="1" ChildName="First">
<Parents>
<Parent ParentId="1" ParentName="First" RelationshipStart="1900-01-01T00:00:00"/>
<Parent ParentId="2" ParentName="Second" RelationshipStart="2000-01-01T00:00:00"/>
</Parents>
</Child>
<Child ChildId="2" ChildName="Second">
<Parents>
<Parent ParentId="2" ParentName="Second" RelationshipStart="1900-01-01T00:00:00"/>
<Parent ParentId="3" ParentName="Third" RelationshipStart="2000-01-01T00:00:00"/>
</Parents>
</Child>
</Children>
</Hierarchy>
Local collections I am trying to load:
public Dictionary<int, string> Parents { get; }
public Dictionary<int, string> Children { get; }
public List<Relationship> Relationships { get; }
Manual version (not maintainable and doesn't use xsd
):
public void ParseFileManually(string fileName)
{
using (var reader = XmlReader.Create(fileName))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element && reader.Name == "Hierarchy")
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element && reader.Name == "Child")
{
int childId = int.Parse(reader.GetAttribute("ChildId"));
string childName = reader.GetAttribute("ChildName");
Children[childId] = childName;
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Element && reader.Name == "Parent")
{
int parentId = int.Parse(reader.GetAttribute("ParentId"));
string parentName = reader.GetAttribute("ParentName");
DateTime relationshipStart = DateTime.Parse(reader.GetAttribute("RelationshipStart"));
Parents[parentId] = parentName;
Relationships.Add(
new Relationship{
ParentId = parentId,
ChildId = childId,
Start = relationshipStart
});
}
else if (reader.NodeType == XmlNodeType.EndElement && reader.Name == "Child")
{
break;
}
}
}
}
}
}
}
}
Deserialize version (loops through the data twice):
public void ParseFileWithDeserialize(string fileName)
{
var serializer = new XmlSerializer(typeof(Hierarchy));
using (var fileStream = new FileStream(fileName, FileMode.Open))
{
var fileData = (Hierarchy) serializer.Deserialize(fileStream);
foreach (var child in fileData.Children)
{
Children[child.ChildId] = child.ChildName;
foreach (var parent in child.Parents)
{
Parents[parent.ParentId] = parent.ParentName;
Relationships.Add(
new Relationship
{
ParentId = parent.ParentId,
ChildId = child.ChildId,
Start = parent.RelationshipStart
});
}
}
}
}