18

I'm revisiting som old code of mine and have stumbled upon a method for getting the title of a website based on its url. It's not really what you would call a stable method as it often fails to produce a result and sometimes even produces incorrect results. Also, sometimes it fails to show some of the characters from the title as they are of an alternative encoding.

Does anyone have suggestions for improvements over this old version?

public static string SuggestTitle(string url, int timeout)
{
    WebResponse response = null;
    string line = string.Empty;

    try
    {
        WebRequest request = WebRequest.Create(url);
        request.Timeout = timeout;

        response = request.GetResponse();
        Stream streamReceive = response.GetResponseStream();
        Encoding encoding = System.Text.Encoding.GetEncoding("utf-8");
        StreamReader streamRead = new System.IO.StreamReader(streamReceive, encoding);

        while(streamRead.EndOfStream != true)
        {
            line = streamRead.ReadLine();
            if (line.Contains("<title>"))
            {
                line = line.Split(new char[] { '<', '>' })[2];
                break;
            }
        }
    }
    catch (Exception) { }
    finally
    {
        if (response != null)
        {
            response.Close();
        }
    }

    return line;
}

One final note - I would like the code to run faster as well, as it is blocking until the page as been fetched, so if I can get only the site header and not the entire page, it would be great.

Morten Christiansen
  • 19,002
  • 22
  • 69
  • 94

3 Answers3

52

A simpler way to get the content:

WebClient x = new WebClient();
string source = x.DownloadString("http://www.singingeels.com/");

A simpler, more reliable way to get the title:

string title = Regex.Match(source, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
    RegexOptions.IgnoreCase).Groups["Title"].Value;
Owen Blacker
  • 4,117
  • 2
  • 33
  • 70
Timothy Khouri
  • 31,315
  • 21
  • 88
  • 128
10

Perhaps with this suggestion a new world opens up for you I also had this question and came to this

Download "Html Agility Pack" from http://html-agility-pack.net/?z=codeplex

Or go to nuget: https://www.nuget.org/packages/HtmlAgilityPack/ And add in this reference.

Add folow using in the code file:

using HtmlAgilityPack;

Write folowing code in your methode:

var webGet = new HtmlWeb();
var document = webGet.Load(url);    
var title = document.DocumentNode.SelectSingleNode("html/head/title").InnerText;

Sources:

https://codeshare.co.uk/blog/how-to-scrape-meta-data-from-a-url-using-htmlagilitypack-in-c/ HtmlAgilityPack obtain Title and meta

Roberto B
  • 542
  • 5
  • 13
-1

Inorder to accomplish this you are going to need to do a couple of things.

  • Make your app threaded, so that you can process multiple requests at the time and maximize the number of HTTP requests that are being made.
  • Durring the async request, download only the amount of data you want to pull back, you could probably do parsing on the data as it comes back looking for
  • Probably want to use regex to pull out the title name

I have done this before with SEO bots and I have been able to handle almost 10,000 requests at a single time. You just need to make sure that each web request can be self contained in a thread.

Nick Berardi
  • 54,393
  • 15
  • 113
  • 135
  • You certainly *don't* want to give each request its own thread if you want to handle 10,000 requests at a time! (The stack involved would eat you your memory like crazy.) Using an async API will parallelize the operation *without* costing you a thread per request. – Jon Skeet Nov 30 '08 at 20:34
  • Its a moot point as I only need to perform a single request at a time. The need for speed is because the user is waiting for the reply. – Morten Christiansen Nov 30 '08 at 20:51
  • @Jon, well like I said mine was an SEO bot that analyzes and obviously you want to put limits on the number of requests at a time per analysis to keep the memory reasonable. However the 10,000 was a stress test scenario. And the async was a suggestion on how to just download the header. – Nick Berardi Dec 01 '08 at 14:02
  • @Morten, I was just going off the very basic details you gave me. You said you wanted it to run faster, and that you only wanted to download the header the async request is the best way to limit the size that is downloaded, because you can stop the process when you have found your answer. – Nick Berardi Dec 01 '08 at 14:04
  • @Jon, you are using a pretty definite statement in that you don't want a thread for each request, that may be true but you are forgetting about the analysis that goes along with each request. There would be a horrible queue build up if the analysis processor was single threaded. – Nick Berardi Dec 01 '08 at 14:06