Simple regex help using C# (Regex pattern included)

Question

I have some website source stream I am trying to parse. My current Regex is this:

Regex pattern = new Regex (
@"<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );

But it doesn't match the links anymore. I included a sample string here.

Basically I am trying to match these:

<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>

"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**

In the sample I posted, there are at least 3, I didn't count the other ones.

Also I use RegexHero (online and free) to see my matching interactively before adding it to code.

@Joan Venge For reference: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — , Sep 25 '11 at 04:40

score 4 · Accepted Answer · answered Sep 25 '11 at 04:43

For completeness, here how it's done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).

Loading the document, parsing it, and finding the 3 links is as simple as:

string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
                              .Where(link => link.Id.StartsWith(linkIdPrefix));

That's it, really. Now you can easily get the data:

foreach (var link in threadLinks)
{
    string href = link.GetAttributeValue("href", null);
    string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
    string text = link.InnerHtml; // or link.InnerText
    Console.WriteLine("{0} - {1}", id, href);
}

Kobi · Answer 2 · 2011-09-25T18:19:12.517

3

This is quite simple, the markup changed, and now the href attribute appears before the id:

<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $1: href
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $2: id
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag

Note that:

This is mainly why this is a bad idea.
The group numbers have changed. You can use named groups instead, while you're at it: (?<ID>[^>\s'""]+) instead of ([^>\s'""]+).
The quotes are still escaped (this should be OK in character sets)

Example on regex hero.

edited Sep 25 '11 at 18:19

answered Sep 25 '11 at 04:12

Kobi

135,331
41
252
292

Thanks, in your example link, is it modified? When I open it, it says 0 matches. – Joan Venge Sep 25 '11 at 04:24
@JoanVenge - That is strange... I'll let it be, it already failed me trice, but I think the idea is clear anyway `:)` Thanks! – Kobi Sep 25 '11 at 04:46
2

Regex Hero truncates the target string when using the permalink feature if it's longer than 4,000 characters. It occurs to me that I should probably raise the limit. @Joan - If you copy and paste your original html, then Kobi's regular expression should work. – Steve Wortham Sep 25 '11 at 14:31
2

I raised the limit to 500,000 characters. So this should work... http://regexhero.net/tester/?id=2509fab5-243f-4fa3-aeb2-61658ae38f7b – Steve Wortham Sep 25 '11 at 15:02
@Steve, thanks man, it really works. Btw thanks Kobi for providing 2 great examples. I don't know regex but I tried your Agile sample and it works even better, so switched to that. – Joan Venge Sep 26 '11 at 02:20
1

@Joan and Kobi - You're welcome. And you're absolutely right in using HTML Agility Pack in the scenario. It's what I would do as well. By the way, I'm working on a new tool called XML Hero which will help with things like this. – Steve Wortham Sep 27 '11 at 20:13
@Steve: Is your tool something like Regex Hero? – Joan Venge Sep 27 '11 at 20:50
@Joan - Yes, the details are being worked out as it's still early. But it'll be a tool to help with XML and HTML parsing via XPATH and the Html Agility Pack. The last feature I implemented is code completion, which is working great. I'll also implement benchmarking and .NET code generation. I'll send an email out to all registered Regex Hero users when it's released. – Steve Wortham Sep 27 '11 at 21:01
@Steve: Awesome man, I didn't know you wrote Regex Hero, had to look at your profile. Certainly great tools for developers. – Joan Venge Sep 27 '11 at 21:04

score 1 · Answer 3 · edited May 23 '17 at 11:47

1

Don't do that (well, almost, but it's not for everyone). Parsers are meant for that type of thing.

edited May 23 '17 at 11:47

Community

1
1

answered Sep 25 '11 at 04:00

Icarus

63,293
14
100
115

2

Thanks but I need a quickfix, not a major change. Besides it's a personal tool no one uses anyway. Also I see many instances of similar practice in production code, so I think even most programmers don't follow these good practices. – Joan Venge Sep 25 '11 at 04:05

Simple regex help using C# (Regex pattern included)

3 Answers3