Regular expression to match href, but no media files

Question

I am using this regex to math all contents of href's on a page:

(?:href)=[\"|']?(.*?)[\"|'|>]+

It works fine. But i want to match only links that are not media like (png|jpg|avi|wav|gif) etc.

I tried something like adding

((?!png).)

to my regex, but this did not work. I read this question but could not get any working solution.

Regex is almost never a good choice for parsing XML based documents. But once you get the href value, instantiate a URI to do the path dissection. — Stefan H, Jan 24 '13 at 19:35
i think it is more performant than using htmlagilitypack or something else for xml parsing. or is it more performant? — pila, Jan 24 '13 at 19:36
performance isn't necessarilly the issue, it is the fact that hrefs can come in many different forms that your code might not cover but a true XML or HTML parser would. — Stefan H, Jan 24 '13 at 19:37
I agree with @StefanH - the more I look at what you are try to do, the more problems I see with it. You're gut feeling that regex can do the job [is correct](http://stackoverflow.com/a/4234491/211627), but you are underestimating just how complicated the task will be. You should strongly consider the HTMLAgilityPack. — JDB, Jan 24 '13 at 19:43
If you dont like htmlagilitypack there's always MSHTML. Or if you are parsing XHTML then you can use the System.Xml namespace. — Sam Axe, Jan 24 '13 at 20:00

score 3 · Answer 1 · answered Mar 12 '13 at 21:33

I know this question was already answered.

I'd like to offer a different approach using CsQuery instead of HtmlAgilityPack

I think the syntax is more compact and is very similar to other structures since it's based on LINQ

//input is your input HTML string
var links = CQ.Create(input).Find("a").Select(x=>x.Cq().Attr("href"));

For example

var links = CQ.Create("<div><a href='blah'></a><a href='blah2'></a></div>").Find("a").Select(x=>x.Cq().Attr("href"));
Console.Write(string.Join(",",dom)); //prints blah,blah2

Hope this helps anyone :)

a1204773 · Accepted Answer · 2013-01-26T15:53:28.867

2

using HtmlAgilityPack;

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
List<string> href = new List<string>();

private void addHREF()
{
    //put your input to check
    string input = "";

    doc.LoadHtml(input);
    //Which files ignore?
    string[] stringArray = { ".png", ".jpg" };
    foreach (var item in doc.DocumentNode.SelectNodes("//a"))
    {
        string value = item.Attributes["href"].Value;
        if (stringArray.Any(value.Contains) == false)
            href.Add(value);
    }
}

I tested with my input works great... if you have any problem let me know..

edited Jan 26 '13 at 15:53

answered Jan 25 '13 at 01:58

a1204773

6,923
20
64
94

thanks, i finally switched from regex to htmlagilitypack - works fine. so this one is accepted answers. but others regex's are nice too! thanks all – pila Jan 25 '13 at 12:46

score 1 · Answer 3 · answered Jan 24 '13 at 20:25

Even though I recommend against this approach, you may find this regex helpful:

(?<=href\s*=\s*['"]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)

(Based on URL regex from 8 Regular Expressions You Should Know)

Note that this expression will not allow spaces in the URL. This is because HREF's without quotes will match the following attribute (for example, "domain.com/resource.txt title")

EXAMPLE:

static void Main( string[] args )
{

    string l_input =
        "<a href=\n" +
        "        \"HTTPS://example.com/page.html\" title=\"match\" />\n" +
        "<a href='http://site.com/pic.png' title='do not match'> <a href=domain.com/resource.txt title=match>\n" +
        " <script src=scripts.com/script.js>";

    foreach ( Match l_match in Regex.Matches( l_input, @"(?<=href\s*=\s*['""]?)(?>(https?://)?([\da-z\.-]+)\.([a-z\.]{2,6})([/\w\.-]*)*/?)(?<!png|gif|etc)", RegexOptions.IgnoreCase ) )
        Console.WriteLine( "'" + l_match.Value + "'" );

    /* 
     * Returns:
     * 
     * HTTPS://example.com/page.html
     * domain.com/resource.txt
     *          
     */

    Console.ReadKey( true );

}

MikeM · Answer 4 · 2013-01-24T23:51:31.420

1

My effort

@"(?<=\shref\s*=\s*[""']?)(?![""']|\S+\.(?:png|jpg|avi|wav|gif)[""']?[\s>])\S+?(?=[""']?[\s>])";

It uses a positive look-behind to locate the content, and a negative lookahead to make sure it doesn't contain a dot followed by either of png jpg avi wav gif followed by an optional quote mark and a space or >. It then matches up until an optional quote mark followed by a space or >. The content does not have to be quoted but it must not contain whitespace.

edited Jan 24 '13 at 23:51

answered Jan 24 '13 at 20:57

MikeM

13,156
2
34
47

This will not match `href = "http..."` (spaces). Otherwise, not bad. – JDB Jan 24 '13 at 22:17
@Cyborgx37. Updated to allow space around the `=`, and to allow non-quoted content. – MikeM Jan 24 '13 at 23:54

Regular expression to match href, but no media files

4 Answers4