2

I have some website source stream I am trying to parse. My current Regex is this:

Regex pattern = new Regex (
@"<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );

But it doesn't match the links anymore. I included a sample string here.

Basically I am trying to match these:

<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>

"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**

In the sample I posted, there are at least 3, I didn't count the other ones.

Also I use RegexHero (online and free) to see my matching interactively before adding it to code.

Charles
  • 50,943
  • 13
  • 104
  • 142
Joan Venge
  • 315,713
  • 212
  • 479
  • 689

3 Answers3

4

For completeness, here how it's done with the Html Agility Pack, which is a robust HTML parser for .Net (also available through NuGet, so installing it takes about 20 seconds).

Loading the document, parsing it, and finding the 3 links is as simple as:

string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
                              .Where(link => link.Id.StartsWith(linkIdPrefix));

That's it, really. Now you can easily get the data:

foreach (var link in threadLinks)
{
    string href = link.GetAttributeValue("href", null);
    string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
    string text = link.InnerHtml; // or link.InnerText
    Console.WriteLine("{0} - {1}", id, href);
}
Kobi
  • 135,331
  • 41
  • 252
  • 292
3

This is quite simple, the markup changed, and now the href attribute appears before the id:

<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $1: href
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $2: id
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag

Note that:

  • This is mainly why this is a bad idea.
  • The group numbers have changed. You can use named groups instead, while you're at it: (?<ID>[^>\s'""]+) instead of ([^>\s'""]+).
  • The quotes are still escaped (this should be OK in character sets)

Example on regex hero.

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • Thanks, in your example link, is it modified? When I open it, it says 0 matches. – Joan Venge Sep 25 '11 at 04:24
  • @JoanVenge - That is strange... I'll let it be, it already failed me trice, but I think the idea is clear anyway `:)` Thanks! – Kobi Sep 25 '11 at 04:46
  • 2
    Regex Hero truncates the target string when using the permalink feature if it's longer than 4,000 characters. It occurs to me that I should probably raise the limit. @Joan - If you copy and paste your original html, then Kobi's regular expression should work. – Steve Wortham Sep 25 '11 at 14:31
  • 2
    I raised the limit to 500,000 characters. So this should work... http://regexhero.net/tester/?id=2509fab5-243f-4fa3-aeb2-61658ae38f7b – Steve Wortham Sep 25 '11 at 15:02
  • @Steve, thanks man, it really works. Btw thanks Kobi for providing 2 great examples. I don't know regex but I tried your Agile sample and it works even better, so switched to that. – Joan Venge Sep 26 '11 at 02:20
  • 1
    @Joan and Kobi - You're welcome. And you're absolutely right in using HTML Agility Pack in the scenario. It's what I would do as well. By the way, I'm working on a new tool called XML Hero which will help with things like this. – Steve Wortham Sep 27 '11 at 20:13
  • @Steve: Is your tool something like Regex Hero? – Joan Venge Sep 27 '11 at 20:50
  • @Joan - Yes, the details are being worked out as it's still early. But it'll be a tool to help with XML and HTML parsing via XPATH and the Html Agility Pack. The last feature I implemented is code completion, which is working great. I'll also implement benchmarking and .NET code generation. I'll send an email out to all registered Regex Hero users when it's released. – Steve Wortham Sep 27 '11 at 21:01
  • @Steve: Awesome man, I didn't know you wrote Regex Hero, had to look at your profile. Certainly great tools for developers. – Joan Venge Sep 27 '11 at 21:04
1

Don't do that (well, almost, but it's not for everyone). Parsers are meant for that type of thing.

Community
  • 1
  • 1
Icarus
  • 63,293
  • 14
  • 100
  • 115
  • 2
    Thanks but I need a quickfix, not a major change. Besides it's a personal tool no one uses anyway. Also I see many instances of similar practice in production code, so I think even most programmers don't follow these good practices. – Joan Venge Sep 25 '11 at 04:05