0

Overview

I am currently trying to write a parser for the site that can be found in this page.

I already tried XPath (which i am pretty good at) and I failed miserably trying to achieve the expected results, so I have been trying to use Regular Expressions since yesterday.

My Goal

My goal here, is to split this html in fragments, each fragment containing data of a single course.

Eg: "AF - Bacharelado em Artes Visuais" is the course name, and the subjects can be found in the blue tables until 08º Semestre: 24 Créditos.

After that, you can see "AG - Licenciatura em Artes - Artes Visuais", which is the start of a new course, and so on.

This page only have two courses, but`this one can have more than 2.

Regular Expressions issue

A friend of mine gave me a hand and figured out that using this pattern and options, works for reaching the name of the courses. Here is some code :

// Creating Regular Expression to find name of courses
Regex regex = new Regex ("<p><br><b><font face=\"Arial,Helvetica\"><font color=\"#000099\"><font size=-1>(.+?)</font></font></font></b>", RegexOptions.Singleline);

int startIndex = 0;
while (regex.IsMatch (auxHtml, startIndex))
    {
        // Checking name of the course and saving it's offset
        int index         = regex.Match(auxHtml, startIndex).Groups[1].Index;
        string courseName = regex.Match(auxHtml, startIndex).Groups[1].Value;
    } 

Problem

Since I can reach the name of a course and it's offset (Index), theoretically, I might be able to split the Html in pieces in which each one would contain just the data related to a single course.

Here is the code I am using to try it.

  • htmlPages is a list of Strings
  • auxHtml is the HtmlPage retrieved by the WebRequest

Code

// Creating Regular Expression to find name of courses
Regex regex = new Regex ("<p><br><b><font face=\"Arial,Helvetica\"><font color=\"#000099\"><font size=-1>(.+?)</font></font></font></b>", RegexOptions.Singleline);

int startIndex = 0;
while (regex.IsMatch (auxHtml, startIndex))
{
    // Checking name of the course and saving it's offset
    int index         = regex.Match(auxHtml, startIndex).Groups[1].Index;
    string courseName = regex.Match(auxHtml, startIndex).Groups[1].Value;

    // Adding name of the course and offset to dictionary
    courseIndex.Add (courseName,index);
    startIndex        = regex.Match(auxHtml, startIndex).Groups[1].Index;

    // Splitting HTML Page
    if (regex.IsMatch(auxHtml, startIndex))
    {
        int endIndex = regex.Match (auxHtml, startIndex).Groups[1].Index;
        endIndex  = endIndex - startIndex;
        htmlPiece = auxHtml.Remove(startIndex, endIndex);
    }

    htmlPages.Add(auxHtml);
}

I don't know why but, the index is sort of messed.

The index of the second course name is 8022, but, if I try:

auxHtml.Substring(0,8022) 

it gives me a part of the html that ends way before the name of the next course.

What am I missing here?

Isn't this "Index" attribute of a Group, the index of the start of the pattern in the html page?

Brad Rem
  • 6,036
  • 2
  • 25
  • 50
Marcello Grechi Lins
  • 3,350
  • 8
  • 38
  • 72
  • Fixed the links, sorry guys, chrome messed them up for some reason and i ended copying the same link twice, and none of them were right. – Marcello Grechi Lins Jul 27 '12 at 14:11
  • 1
    Uh oh, somebody just asked how to parse html with regex. http://www.helloloser.com/wp-content/uploads/2012/06/Mj-thriller-popcorn-o.gif – Hans Z Jul 27 '12 at 14:11
  • 2
    You should not [parse (X)HTML with regex.](http://stackoverflow.com/a/1732454/451590). HTML is not regular, and as such is a bad candidate for regular expressions. Use a full-fledged HTML parser. – David B Jul 27 '12 at 14:11
  • 1
    Don't use regexp to parse HTML. http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not HTML is more complicated than a simple regular expression can parse. Plus, why reinvent the wheel when thousands of good HTML parsers exist that don't use regex. – hsanders Jul 27 '12 at 14:11
  • @hsanders because the HTML is not well formed, and have no relation between a course name and their subjects. Take a look on it on chrome,and you will see what i am talking about. – Marcello Grechi Lins Jul 27 '12 at 14:19

4 Answers4

5

While you might be able to almost achieve what you're looking for using RegEx(s) it's bound to be difficult.

Regular Expressions are not the right tool for this job. You will be much much better off using an XML parser to parse out HTML. That is because HTML (and XML in general) markup is not a regular language - hence Regular Expressions are not very useful in this case.

You should look at the System.Xml.XmlDocument class.

Mike Dinescu
  • 54,171
  • 16
  • 118
  • 151
4

You should not use regular expressions to parse HTML. True regular expressions are incapable of it, and extended regular expressions are unsuited to it. You should use an existing parsing library to process HTML, and if you must do the processing yourself, you should base your solution on context-free languages, rather than regular languages.

Thom Smith
  • 13,916
  • 6
  • 45
  • 91
3

Don't use Regex for HTML, use the Html Agility Pack to allow you to use XPath on Html instead.

The problem is HTML is not a well behaved language there are too many exceptions to the rule for a Regex to parse. Libraries like the Html Agility Pack where specifically made to solve this issue.

Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
1

Even tho Regular expressions were not indicated for this case, i used it, and could solve my problem.

I won't copy any code, because the code is huge, but i will explain what i did.

I Used this regular expression to find the course Names

Regex regex = new Regex ("<p><br><b><font face=\"Arial,Helvetica\"><font   color=\"#000099\"><font size=-1>(.+?)</font></font></font></b>", RegexOptions.Singleline);`

After that i managed to find the Offset of each course name

Once i fetched the offset of each coursename, i splited the Html in segments where the start of the segment is the offset right before the course name, and the end is the offset right before the NEXT course name, or the end of the file in the case of the course to be the last one in the html.

For those who are interested, here is the code to my implementation

I Hope this helps people like me, trying to parse non wellformed html's.

Now please, for those who said that regex are uncapable of performing this task, take some time to read my code, it might change your mind.

Marcello Grechi Lins
  • 3,350
  • 8
  • 38
  • 72