1

I am using this regex to math all contents of href's on a page:

(?:href)=[\"|']?(.*?)[\"|'|>]+

It works fine. But i want to match only links that are not media like (png|jpg|avi|wav|gif) etc.

I tried something like adding

((?!png).)

to my regex, but this did not work. I read this question but could not get any working solution.

Community
  • 1
  • 1
pila
  • 928
  • 3
  • 11
  • 28
  • 4
    Regex is almost never a good choice for parsing XML based documents. But once you get the href value, instantiate a URI to do the path dissection. – Stefan H Jan 24 '13 at 19:35
  • i think it is more performant than using htmlagilitypack or something else for xml parsing. or is it more performant? – pila Jan 24 '13 at 19:36
  • 2
    performance isn't necessarilly the issue, it is the fact that hrefs can come in many different forms that your code might not cover but a true XML or HTML parser would. – Stefan H Jan 24 '13 at 19:37
  • 1
    I agree with @StefanH - the more I look at what you are try to do, the more problems I see with it. You're gut feeling that regex can do the job [is correct](http://stackoverflow.com/a/4234491/211627), but you are underestimating just how complicated the task will be. You should strongly consider the HTMLAgilityPack. – JDB Jan 24 '13 at 19:43
  • If you dont like htmlagilitypack there's always MSHTML. Or if you are parsing XHTML then you can use the System.Xml namespace. – Sam Axe Jan 24 '13 at 20:00

4 Answers4

3

I know this question was already answered.

I'd like to offer a different approach using CsQuery instead of HtmlAgilityPack

I think the syntax is more compact and is very similar to other structures since it's based on LINQ

//input is your input HTML string
var links = CQ.Create(input).Find("a").Select(x=>x.Cq().Attr("href"));

For example

var links = CQ.Create("<div><a href='blah'></a><a href='blah2'></a></div>").Find("a").Select(x=>x.Cq().Attr("href"));
Console.Write(string.Join(",",dom)); //prints blah,blah2

Hope this helps anyone :)

Benjamin Gruenbaum
  • 270,886
  • 87
  • 504
  • 504
2
using HtmlAgilityPack;

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
List<string> href = new List<string>();

private void addHREF()
{
    //put your input to check
    string input = "";

    doc.LoadHtml(input);
    //Which files ignore?
    string[] stringArray = { ".png", ".jpg" };
    foreach (var item in doc.DocumentNode.SelectNodes("//a"))
    {
        string value = item.Attributes["href"].Value;
        if (stringArray.Any(value.Contains) == false)
            href.Add(value);
    }
}

I tested with my input works great... if you have any problem let me know..

a1204773
  • 6,923
  • 20
  • 64
  • 94
  • thanks, i finally switched from regex to htmlagilitypack - works fine. so this one is accepted answers. but others regex's are nice too! thanks all – pila Jan 25 '13 at 12:46
1

Even though I recommend against this approach, you may find this regex helpful:

(?<=href\s*=\s*['"]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)

(Based on URL regex from 8 Regular Expressions You Should Know)

Note that this expression will not allow spaces in the URL. This is because HREF's without quotes will match the following attribute (for example, "domain.com/resource.txt title")

EXAMPLE:

static void Main( string[] args )
{

    string l_input =
        "<a href=\n" +
        "        \"HTTPS://example.com/page.html\" title=\"match\" />\n" +
        "<a href='http://site.com/pic.png' title='do not match'> <a href=domain.com/resource.txt title=match>\n" +
        " <script src=scripts.com/script.js>";

    foreach ( Match l_match in Regex.Matches( l_input, @"(?<=href\s*=\s*['""]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)", RegexOptions.IgnoreCase ) )
        Console.WriteLine( "'" + l_match.Value + "'" );

    /* 
     * Returns:
     * 
     * HTTPS://example.com/page.html
     * domain.com/resource.txt
     *          
     */

    Console.ReadKey( true );

}
JDB
  • 25,172
  • 5
  • 72
  • 123
1

My effort

@"(?<=\shref\s*=\s*[""']?)(?![""']|\S+\.(?:png|jpg|avi|wav|gif)[""']?[\s>])\S+?(?=[""']?[\s>])";

It uses a positive look-behind to locate the content, and a negative lookahead to make sure it doesn't contain a dot followed by either of png jpg avi wav gif followed by an optional quote mark and a space or >. It then matches up until an optional quote mark followed by a space or >. The content does not have to be quoted but it must not contain whitespace.

MikeM
  • 13,156
  • 2
  • 34
  • 47