Having trouble with regular expression

Question

I am a total noob at regular expressions and need to parse some html. I am looking for individual categories. The following is what the html looks like:

<p>Categories: 
        <a href="/some/URL/That/I/dont/need">Category1</a>  | 
        <a href="/could/be/another/URL/That/I/dont/need">Category2</a> 
</p>

There could be 1-5 categories. What I need is the "Category1 or Category2 etc"

This project is in c# using Visual Studio 2010. Currently what I have is this:

private static readonly Regex _categoriesRegex = new Regex("(<p>Categories:)((/w/.?<Categories>.*?).*?)(</p>)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

I know I am probably way off but wondering if anyone could at least lead me in the right direction.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Yuriy Faktorovich, Dec 04 '10 at 21:05
@Yuriy, that's precisely what I was going to post... you were faster by a minute ;) — Thomas Levesque, Dec 04 '10 at 21:07
The application I am using is a simple web parser and I am just looking for something simple and can have false negatives or positives. I just need something that works 95% of the time. I am using regex for simple expressions else where in the code however am stuck on this because it is more complicated then looking for a specific tag and end tag. — bvandrunen, Dec 04 '10 at 21:11
@Thomas, I figured I'd get it in without any more sarcasm heading his way. — Yuriy Faktorovich, Dec 04 '10 at 22:07

score 6 · Answer 1 · edited Nov 27 '17 at 15:07

6

Don't use regex for this kind of task, use a dedicated tool instead. Your best option is probably to use HTML Agility Pack.

EDIT: here's an example using HTML Agility Pack (written in LINQPad):

void Main()
{
    var doc = new HtmlDocument();
    doc.Load(@"D:\tmp\foobar.html");
    var query =
        from p in doc.DocumentNode.Descendants("p")
        where p.InnerText.StartsWith("Categories:")
        from a in p.Elements("a")
        select a.InnerText;

    query.Dump();
}

It returns:

Category1
Category2

I should note that it was the first time I actually tried to use HAP, and I'm pleasantly surprised by how easy it is (writing the code above took about 3 minutes). The API is very similar to Linq to XML, which makes it very intuitive if you're comfortable with Linq.

edited Nov 27 '17 at 15:07

carla

1,970
1
31
44

answered Dec 04 '10 at 21:21

Thomas Levesque

286,951
70
623
758

1

I was going to post an answer along the same line, but than I wondered how this task would actually be performed using HTML Agility Pack. It's recommended every single time someone attempts to parse HTML with regex, but not very often is an example given (and the examples at the HTML Agility Pack website are not worth speaking about). – dtb Dec 04 '10 at 21:27
I agree with dtb. This is for a simple web scraper. I already have the html coming into as a variable. It is simple and it doesn't have to work 100% of the time. I don't need a huge solution to a small problem. – bvandrunen Dec 04 '10 at 21:32
2

@bvandrunen: Oh, I strongly believe that regex is the wrong tool here, and the HTML Agility Pack is the right one, even for a small task. I just wish more people would post actual examples of how to accomplish the task with HTML Agility Pack, rather than just posting a link to it. – dtb Dec 04 '10 at 21:40
1

@bvandrunen: I don't understand you when you say: "it doesn't have to work 100% of the time". I've never heard about that when solving an specific problem/task. – Oscar Mederos Dec 04 '10 at 21:53
1

@dtb, you're absolutely right. I mentioned HAP because I always see it recommended for this kind of task, but I had never used it myself until today. So I downloaded it and tried to answer the question with it, it turned out to be pretty easy. @bvandrunen, see my updated answer for an example. – Thomas Levesque Dec 04 '10 at 21:53

score 1 · Answer 2 · edited May 23 '17 at 10:26

Usually the HTML Agility Pack (HAP) is suggested for these types of questions, and Thomas' solution is great, however I'm usually not 100% for it if you can guarantee that your input is well-formed and your desired result is straightforward. If that's the case then you can usually get by with using LINQ to XML instead of introducing HAP to your project. I demonstrate this approach below. I've also included a regex approach since your request isn't too wild, given that non-nested input is simple to deal with.

I recommend you stick with the LINQ solution since it's maintainable and easy for others to understand. The regex was added only to demonstrate how to do it and address your original question.

string input = @"<p>Categories: 
        <a href=""/some/URL/That/I/dont/need"">Category1</a>  | 
        <a href=""/could/be/another/URL/That/I/dont/need"">Category2</a> 
</p>";

// LINQ to XML approach for well formed HTML
var xml = XElement.Parse(input);
var query = xml.Elements("a").Select(e => e.Value);
foreach (var item in query)
{
    Console.WriteLine(item);
}

// regex solution
string pattern = @"Categories:(?:[^<]+<a[^>]+>([^<]+)</a>)+";

Match m = Regex.Match(input, pattern);
if (m.Success)
{
    foreach (Capture c in m.Groups[1].Captures)
    {
        Console.WriteLine(c.Value);    
    }
}

HTML pages are rarely well-formed XML documents (except XHTML), so Linq to XML won't work in most cases — Thomas Levesque, Dec 04 '10 at 22:39
@Thomas I agree, hence my emphasis. Whoever decides to go this route needs to know their input well. — Ahmad Mageed, Dec 04 '10 at 22:42

Oscar Mederos · Answer 3 · 2010-12-04T22:14:03.210

Addint a little bit to @Thomas Levesque answer (wich is the right way to go):

If you want to get the link instead of the text between <a> tags, you just need to do:

    var query =
        from p in doc.DocumentNode.Descendants("p")
        where p.InnerText.StartsWith("Categories:")
        from a in p.Elements("a")
        select a.Attributes["href"].Value;

EDIT: If you're not familiar with LINQ syntax, you could get the same with:

var nodes = doc.DocumentNode.SelectNodes("//p"); //Here I get all the <p> tags in the document
if (nodes != null)
{
    foreach (var n in nodes)
    {
        if (n.InnerText.StartsWith("Categories:")) //If the <p> tag we need was found
        {
            foreach (var a in n.SelectNodes("./a[@href]")) //Iterating through all <a> tags that are next to the <p> tag (childs)
            {
                //It will print something like: "Name: Category1        Link: /some/URL/That/I/dont/need
                Console.WriteLine("Name: {0} \t Link: {1}", a.InnerText, a.Attributes["href"].Value; 
            }
            break;
        }
    }
}

Having trouble with regular expression

3 Answers3