Overview
I am currently trying to write a parser for the site that can be found in this page.
I already tried XPath (which i am pretty good at) and I failed miserably trying to achieve the expected results, so I have been trying to use Regular Expressions since yesterday.
My Goal
My goal here, is to split this html in fragments, each fragment containing data of a single course.
Eg: "AF - Bacharelado em Artes Visuais"
is the course name, and the subjects can be found in the blue tables until 08º Semestre: 24 Créditos
.
After that, you can see "AG - Licenciatura em Artes - Artes Visuais"
, which is the start of a new course, and so on.
This page only have two courses, but`this one can have more than 2.
Regular Expressions issue
A friend of mine gave me a hand and figured out that using this pattern and options, works for reaching the name of the courses. Here is some code :
// Creating Regular Expression to find name of courses
Regex regex = new Regex ("<p><br><b><font face=\"Arial,Helvetica\"><font color=\"#000099\"><font size=-1>(.+?)</font></font></font></b>", RegexOptions.Singleline);
int startIndex = 0;
while (regex.IsMatch (auxHtml, startIndex))
{
// Checking name of the course and saving it's offset
int index = regex.Match(auxHtml, startIndex).Groups[1].Index;
string courseName = regex.Match(auxHtml, startIndex).Groups[1].Value;
}
Problem
Since I can reach the name of a course and it's offset (Index), theoretically, I might be able to split the Html in pieces in which each one would contain just the data related to a single course.
Here is the code I am using to try it.
- htmlPages is a list of Strings
- auxHtml is the HtmlPage retrieved by the WebRequest
Code
// Creating Regular Expression to find name of courses
Regex regex = new Regex ("<p><br><b><font face=\"Arial,Helvetica\"><font color=\"#000099\"><font size=-1>(.+?)</font></font></font></b>", RegexOptions.Singleline);
int startIndex = 0;
while (regex.IsMatch (auxHtml, startIndex))
{
// Checking name of the course and saving it's offset
int index = regex.Match(auxHtml, startIndex).Groups[1].Index;
string courseName = regex.Match(auxHtml, startIndex).Groups[1].Value;
// Adding name of the course and offset to dictionary
courseIndex.Add (courseName,index);
startIndex = regex.Match(auxHtml, startIndex).Groups[1].Index;
// Splitting HTML Page
if (regex.IsMatch(auxHtml, startIndex))
{
int endIndex = regex.Match (auxHtml, startIndex).Groups[1].Index;
endIndex = endIndex - startIndex;
htmlPiece = auxHtml.Remove(startIndex, endIndex);
}
htmlPages.Add(auxHtml);
}
I don't know why but, the index is sort of messed.
The index of the second course name is 8022, but, if I try:
auxHtml.Substring(0,8022)
it gives me a part of the html that ends way before the name of the next course.
What am I missing here?
Isn't this "Index" attribute of a Group, the index of the start of the pattern in the html page?