0

I want to parse all the link tags from html file. So for that I have written following regular expression as below.

var pattern = @"<(LINK).*?HREF=(""|')?(?<URL>.*?)(""|')?.*?>";
var regExOptions = RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Multiline;

var linkRegEx = new Regex(pattern , regExOptions );

foreach (Match match in linkRegEx.Matches(htmlFile))
{
    var group = match.Groups["URL"];
    var url = group.Value;
} 

But what happens is that I found matches from html file but I am getting blank capturing group.

Rajdip Patel
  • 541
  • 1
  • 7
  • 19
  • 4
    Why not use a proper html parser? – Jerry Oct 09 '13 at 19:34
  • Because in html parser it takes whole html file as input. But I don't have complete file. I have just chunk of data of that file. So I can't use that. – Rajdip Patel Oct 09 '13 at 19:35
  • If your Html is xhtml, you can use an xml-parser.. would that work for you? – Gaute Løken Oct 09 '13 at 19:39
  • No, Actually I don't know for that. Actually this is network application so resource can be any type. – Rajdip Patel Oct 09 '13 at 19:40
  • 1
    Alternatively you could wrap your fragment in a html-skeleton and use a proper html parser as Jerry suggested. – Gaute Løken Oct 09 '13 at 19:40
  • 1
    The reason why we're not answering is this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ;) – Gaute Løken Oct 09 '13 at 19:42
  • No, Html Parser takes more memory and I have to consider performance in milliseconds level. SO I can't use third party libs. – Rajdip Patel Oct 09 '13 at 19:42
  • 1
    @RajdipPatel, you are worrying about millisecond performance with c#, and not using c++? – gunr2171 Oct 09 '13 at 19:48
  • 1
    I'm starting to wonder how you're retreiving your html fragments, cause reading those from disk or a network stream will completely dwarf the overhead of a proper parser. I'll leave you with either Html Agility Pack, or converting the html to xhtml using TidyNet, then use the .NET xml-parsing tools. And finally a quote: http://c2.com/cgi/wiki?PrematureOptimization – Gaute Løken Oct 09 '13 at 19:52
  • Yes, because this is my prototype of large project, if this will success then whole network project will be started in pure C++. – Rajdip Patel Oct 09 '13 at 19:53
  • I know Regular Expressions are heavy, but networks operations are even more heavy than regular expressions. So there is no problem to apply it on chuncked data. – Rajdip Patel Oct 09 '13 at 19:55

1 Answers1

1

You could try a pattern like this:

var pattern = @"<(LINK).*?HREF=(?:([""'])(?<URL>.*?)\2|(?<URL>[^\s>]*)).*?>";

This will match:

  • a literal <
  • a literal LINK, captured in group 1
  • zero or more of any character, non-greedily
  • either of the following
    • a single " or ', captured in group 2
    • zero or more of any character, non-greedily, captured in group URL.
    • whatever was matched in group 2 (the \2 is a back-reference)
      or
    • zero or more of any character except whitespace or >, greedily, captured in group URL.
  • zero or more of any character, non-greedily
  • a literal >

This will correctly handle inputs like:

  • <LINK HREF="Foo"> produces url = "Foo"
  • <LINK HREF='Bar'> produces url = "Bar"
  • <LINK HREF=Baz> produces url = "Baz"
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • In this pattern one issue is that it fails when there is no HREF attribute in element. At that time what is does it tries to parse from further child elements. And that is wrong. – Rajdip Patel Oct 09 '13 at 20:28
  • I have tried to create optional href attribute using round bracket as below. @"<(LINK).*?(?:HREF=(?:([""'])(?.*?)\3|(?[^\s>]*)))?.*?>" But at that time url capturing group returns empty even if there is href attribute is present. Should I change backreference \3 ? – Rajdip Patel Oct 09 '13 at 20:30
  • @RajdipPatel Can you post som example inputs and what you want the output to be? – p.s.w.g Oct 09 '13 at 21:49