I am trying to write an XML parser to parse all courses from that a university offers given a calendar year and semester. In particular, I am trying to get the department acronym (i.e. FIN for Finance, etc), the Course Number (i.e. Math 415, 415 would be the number), the Course Name, and the number of credit hours the course is worth.
The file I am trying to parse can be found HERE
EDIT AND UPDATE
Upon reader deeper into XML parsing, and the best way to optimize it, I stumbled upon this blog POST
Assuming the results of the tests run in that article are both honest and accurate, it seems that XmlReader far outperforms both XDocument and XmlDocument, which verifies what is said in the great answers below. Having said that, I re-coded my parser class using XmlReader along with limiting the number of readers used in a single method.
Here is the new parser class:
public void ParseDepartments()
{
// Create reader for the given calendar year and semester xml file
using (XmlReader reader = XmlReader.Create(xmlPath)) {
reader.ReadToFollowing("subjects"); // Navigate to the element 'subjects'
while (!reader.EOF) {
string pth = reader.GetAttribute("href"); // Get department's xml path
string acro = reader.GetAttribute("id"); // Get the department's acronym
reader.Read(); // Read through current element, ensures we visit each element
if (acro != null && acro != string.Empty) { // If the acronym is valid, add it to the department list
deps.AddDepartment(acro, pth);
}
}
}
}
public void ParseDepCourses()
{
// Loop through all the departments, and visit there respective xml file
foreach (KeyValuePair<string, string> department in deps.DepartmentPaths) {
try {
using (XmlReader reader = XmlReader.Create(department.Value)) {
reader.ReadToFollowing("courses"); // Navigate to the element 'courses'
while (!reader.EOF) {
string pth = reader.GetAttribute("href");
string num = reader.GetAttribute("id");
reader.Read();
if (num != null && num != string.Empty) {
string crseName = reader.Value; // reader.Value is the element's value, i.e. <elementTag>Value</elementTag>
deps[department.Key].Add(new CourseObject(num, crseName, termID, pth)); // Add the course to the department's course list
}
}
}
} catch (WebException) { } // WebException is thrown (Error 404) when there is no xml file found, or in other words, the department has no courses
}
}
public void ParseCourseInformation()
{
Regex expr = new Regex(@"^\S(L*)\d\b|^\S(L*)\b|^\S\d\b|^\S\b", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace); // A regular expression that will check each section and determine if it is a 'Lecture' section, at which point, that section's xml file is visited, and instructor added
foreach (KeyValuePair<string, Collection<CourseObject>> pair in deps) {
foreach (CourseObject crse in pair.Value) {
try {
using (XmlReader reader = XmlReader.Create(crse.XmlPath)) {
reader.ReadToFollowing("creditHours"); // Get credit hours for the course
crse.ParseCreditHours(reader.Value); // Class method to parse the string and grab the correct integer values
reader.ReadToFollowing("sections"); // Navigate to the element 'sections'
while (!reader.EOF) {
string pth = reader.GetAttribute("href");
string crn = reader.GetAttribute("id");
reader.Read();
if (crn != null && crn != string.Empty) {
string sction = reader.Value;
if (expr.IsMatch(sction)) { // Check if sction is a 'Lecture' section
using (XmlReader reader2 = XmlReader.Create(pth)) { // Navigate to its xml file
reader2.ReadToFollowing("instructors"); // Navigate to the element 'instructors'
while (!reader2.EOF) {
string firstName = reader2.GetAttribute("firstName");
string lastName = reader2.GetAttribute("lastName");
reader2.Read();
if ((firstName != null && firstName != string.Empty) && (lastName != null && lastName != string.Empty)) { // Check and make sure its a valid name
string instr = firstName + ". " + lastName; // Concatenate into full name
crse.AddSection(pth, sction, crn, instr); // Add section to course
}
}
}
}
}
}
}
} catch (WebException) { } // No course/section information found
}
}
}
Although the execution of this code takes quite some time (anywhere between 10-30 min), it is expected given the large amount of data being parsed. Thanks to everyone who posted answers, it was much appreciated. I hope this helps any other people who may have similar problems/questions.
Thanks,
David