5

Please help me to find the Favicon url from the sample html below using Regular expression. It should also check for file extension ".ico". I am developing a personal bookmarking site and i want to save the favicons of links which i bookmark. I have already written the c# code to convert icon to gif and save but i have very limited knowledge about regex so i am unable to select this tag because ending tags are different in different sites . Example of ending tags "/>" "/link>"

My programming language is C#

<meta name="description" content="Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service." />
<meta name="robots" content="index, follow" />
<meta name="verify-v1" content="x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=" />
<link rel="shortcut icon" href="http://3dbin.com/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" href="http://3dbin.com/css/1261391049/style.min.css" />
<!--[if lt IE 8]>
    <script src="http://3dbin.com/js/1261039165/IE8.js" type="text/javascript"></script>
<![endif]-->

solution: one more way to do this Download and add reference to htmlagilitypack dll. Thanks for helping me. I really love this site :)

 HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(readcontent);

    if (doc.DocumentNode != null)
    {
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//link[@href]"))
        {

            HtmlAttribute att = link.Attributes["href"];
            if (att.Value.EndsWith(".ico"))
            {
                faviconurl = att.Value;
            }
        }
    }
ziaasp
  • 77
  • 2
  • 9
  • 4
    Next time, format the code properly so that people have a slight chance of helping you. If we cannot read the code, how are we supposed to help you. Which language are you using to parse this content? In any case, don't use regular expressions, use an HTML parser. – Felix Kling Jul 02 '11 at 09:26
  • sorry. how am i supposed to format code? i dont understand. i left 4 white spaces in the front but i dont get it – ziaasp Jul 02 '11 at 10:11
  • @ziiee: [You did not indent that line with four spaces](http://stackoverflow.com/revisions/7cbc44d7-ebfa-4d95-a301-5513341ade5e/view-source). But even then, you just posted one long string, so it will end up as one line, which is not helpful either. – Felix Kling Jul 02 '11 at 10:14
  • "i have very limited knowledge about regex": Then what *on Earth* makes you think that regular expressions are the right tool for the job? – johnsyweb Jul 02 '11 at 10:34
  • i thought regex was the only way sir. now i know there are more simple ways – ziaasp Jul 02 '11 at 16:41
  • @ziaasp what is readcontent? i know it's html file. where is that? – Ali Vojdanian Nov 10 '12 at 17:45

4 Answers4

1

I had a go at this a wee while back so here is something that is pretty simple. First it attempts to find the /favicon.ico file. If that fails I load up the page using Html Agility pack and then use xpath to find any tags. I loop through the link tags to see if they have a rel='icon' attribute. If they do I grab the href attribute and expand that if it exists into an absolute url for that site.

Please feel free to play around with this and offer any improvements.

private static Uri GetFaviconUrl(string siteUrl)
{
    // try looking for a /favicon.ico first
    var url = new Uri(siteUrl);
    var faviconUrl = new Uri(string.Format("{0}://{1}/favicon.ico", url.Scheme, url.Host));
    try
    {
        using (var httpWebResponse = WebRequest.Create(faviconUrl).GetResponse() as HttpWebResponse)
        {
            if (httpWebResponse != null && httpWebResponse.StatusCode == HttpStatusCode.OK)
            {
                // Log("Found a /favicon.ico file for {0}", url);
                return faviconUrl;
            }
        }
    }
    catch (WebException)
    {
    }

    // otherwise parse the html and look for <link rel='icon' href='' /> using html agility pack
    var htmlDocument = new HtmlWeb().Load(url.ToString());
    var links = htmlDocument.DocumentNode.SelectNodes("//link");
    if (links != null)
    {
        foreach (var linkTag in links)
        {
            var rel = GetAttr(linkTag, "rel");
            if (rel == null)
                continue;

            if (rel.Value.IndexOf("icon", StringComparison.InvariantCultureIgnoreCase) > 0)
            {
                var href = GetAttr(linkTag, "href");
                if (href == null)
                    continue;

                Uri absoluteUrl;
                if (Uri.TryCreate(href.Value, UriKind.Absolute, out absoluteUrl))
                {
                    // Log("Found an absolute favicon url {0}", absoluteUrl);
                    return absoluteUrl;
                }

                var expandedUrl = new Uri(string.Format("{0}://{1}{2}", url.Scheme, url.Host, href.Value));
                //Log("Found a relative favicon url for {0} and expanded it to {1}", url, expandedUrl);
                return expandedUrl;
            }
        }
    }

    // Log("Could not find a favicon for {0}", url);
    return null;
}

public static HtmlAttribute GetAttr(HtmlNode linkTag, string attr)
{
    return linkTag.Attributes.FirstOrDefault(x => x.Name.Equals(attr, StringComparison.InvariantCultureIgnoreCase));
}
superlogical
  • 14,332
  • 9
  • 66
  • 76
1
<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?

maybe... it is not robust, but could work. (I used perl regex)

ShinTakezou
  • 9,432
  • 1
  • 29
  • 39
  • 1
    if you put type before href and after rel, it won't work... as already suggested the more robust way to do it is through a html parser... – ShinTakezou Jul 02 '11 at 09:37
  • @ziiee: The regexp is basically the same for Perl and C#. – Zano Jul 02 '11 at 10:18
  • luckly "pcre" are rather widespread, and in general many regex syntaxes are similar. ... I also am almost sure there something rather standard in C# to parse html... – ShinTakezou Jul 02 '11 at 12:33
1

This should match the whole link tag that contain href=http://3dbin.com/favicon.ico

 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

Correction based on your comment:

I see you have a C# solutions Excellent! But just in case you were still wondering if it could be done with regular expressions the following expression would do what you want. The group 1 of the match will have only the url.

 <link .*? href="(.*?.ico)"

Simple C# snipet that makes use of it:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

which prints the following to the console:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico
Rob
  • 2,618
  • 2
  • 22
  • 29
  • I dont want to match whole tag. The html i provided above is just a sample. it can vary according to the websites – ziaasp Jul 02 '11 at 10:09
  • Now that many browsers support png favicons, can you make it so that the regex does not skip over other file type icons? – Hermanboxcar Oct 10 '22 at 21:17
  • @Hermanboxcar Sure you would put a group with or '|' operator for the different file extension you want to pick up. Something like this to pick up .ico .png or .jpeg: – Rob Oct 22 '22 at 12:10
1

This is not a job for a regular expression, as you'll see if you spend 2 minutes on StackOverflow looking for how to parse HTML.

Use an HTML parser instead!

Here's a trivial example in Python (I'm sure this is equally do-able in C#):

% python
Python 2.7.1 (r271:86832, May 16 2011, 19:49:41) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen('https://stackoverflow.com/')
>>> soup = BeautifulSoup(page)
>>> link = soup.html.head.find(lambda x: x.name == 'link' and x['rel'] == 'shortcut icon')
>>> link['href']
u'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
>>> link['href'].endswith('.ico')
True
Community
  • 1
  • 1
johnsyweb
  • 136,902
  • 23
  • 188
  • 247