0

I am using HtmlAgilityPack to get the meta and othe descriptions of the page. The code works find for the simple websites such as Tumblr., Twitter, Stack Overflow.

But when I try to load major sites, such as Google it shows me just a title as Google and no description tag. Similary for Facebook it shows me no description but for the title it shows me Update your browser | Facebook.

I am new to this package, I downloaded latest version of it from NuGet package in MS WebMatrix. The code I am using is as:

@using HtmlAgilityPack;
@{
  Layout = "~/_SiteLayout.cshtml";
  var Title = "";
  var Description = "";
  using(var client = new WebClient()){
    var html = client.DownloadString("http://www.facebook.com");
    var doc = new HtmlDocument();
    doc.LoadHtml(html);
    var title = doc.DocumentNode.Descendants("title").FirstOrDefault();
    if(title != null){
        Title = title.InnerText;
    }
    var description = doc.DocumentNode.Descendants("meta")
                                      .Where(n => n.GetAttributeValue("name", String.Empty)
                                      .Contains("description")).FirstOrDefault();
    if(description != null){
        Description = description.GetAttributeValue("content", string.Empty);
    }
  }
}

This issue, by name, looks like an old browser issue. How to fix this?

Afzaal Ahmad Zeeshan
  • 15,669
  • 12
  • 55
  • 103
  • You probably want to send a User Agent string instead of the default one that the .NET framework sends... See: http://stackoverflow.com/a/11841680/736079 As for the description, the HTML sent out by facebook and google doesn't contain the meta-description tag, so there's nothing to find. – jessehouwing Nov 17 '13 at 06:28
  • Oh, that's a good and a new information for me! :) So, what do other websites do? Lets say Facebook. They share the Description of Google too. How do they fetch it? Or do they write it themself? – Afzaal Ahmad Zeeshan Nov 17 '13 at 06:30
  • Facebook seems to send out a meta-description tag only when you're not logged in... When you fix the UserAgent string, they might actually send you one. – jessehouwing Nov 17 '13 at 06:33
  • Ohkie! Let me try that! :) But still they don't sent nothing. – Afzaal Ahmad Zeeshan Nov 17 '13 at 06:40

1 Answers1

1

After searching for a long time for this, I got the solution from Mike Brind on ASP.NET Forums.

var Image = "";
using(var client = new WebClient()){
    client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36");
    client.Headers.Add("method", "GET");
    client.Headers.Add("version", "HTML/1.1");
    client.Headers.Add("accept",
    "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
    var html = client.DownloadString("http://www.google.com");
    var doc = new HtmlDocument();
    doc.LoadHtml(html);
    var title = doc.DocumentNode.Descendants("title").FirstOrDefault();
    if(title != null){
        Title = title.InnerText;
    }
    var description = doc.DocumentNode.Descendants("meta")
                                      .Where(n => n.GetAttributeValue("name", String.Empty)
                                      .Contains("description")).FirstOrDefault();
    if(description != null){
        Description = description.GetAttributeValue("content", string.Empty);
    }
    var image = doc.DocumentNode.Descendants("link")
                                .Where(n => n.GetAttributeValue("rel", String.Empty)
                                .Contains("shortcut icon")).FirstOrDefault();
    if(image != null) {
        Image = image.GetAttributeValue("href", string.Empty);
    }
}

This was the code required for this. Actually the main thing was, when the person creates a new request from his Computer, Browser sends some details to the server, whereas I wasn't sending anything. That's why, Facebook and Google were not returning anything to me. When I tried to include some fake headers, they gave me the details I needed from them.

After that, it was good to go!

Afzaal Ahmad Zeeshan
  • 15,669
  • 12
  • 55
  • 103