7

I'm looking for a method that will allow me to get the title of a webpage and store it as a string.

However all the solutions I have found so far involve downloading the source code for the page, which isn't really practical for a large number of webpages.

The only way I could see would be to limit the length of the string or it only downloads either a set number of chars or stops once it reaches the tag, however this obviously will still be quite large?

Thanks

McMillan Cheng
  • 382
  • 1
  • 6
  • 20
quotidian
  • 137
  • 1
  • 6

2 Answers2

20

As the <title> tag is in the HTML itself, there will be no way to not download the file to find "just the title". You should be able download a portion of the file until you've read in the <title> tag, or the </head> tag and then stop, but you'll still need to download (at least a portion of) the file.

This can be accomplished with HttpWebRequest/HttpWebResponse and reading in data from the response stream until we've either read in a <title></title> block, or the </head> tag. I added the </head> tag check because, in valid HTML, the title block must appear within the head block - so, with this check we will never parse the entire file in any case (unless there is no head block, of course).

The following should be able to accomplish this task:

string title = "";
try {
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);

    using (Stream stream = response.GetResponseStream()) {
        // compiled regex to check for <title></title> block
        Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0) {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success) {
                // we found a <title></title> match =]
                title = m.Groups[1].Value.ToString();
                break;
            } else if (contents.Contains("</head>")) {
                // reached end of head-block; no title found =[
                break;
            }
        }
    }
} catch (Exception e) {
    Console.WriteLine(e);
}

UPDATE: Updated the original source-example to use a compiled Regex and a using statement for the Stream for better efficiency and maintainability.

newfurniturey
  • 37,556
  • 9
  • 94
  • 102
  • 1
    I'd give +2 for the sad face on the last comment but I can't =[ – Charleh Jul 25 '12 at 15:39
  • 2
    This is a great code solution, thanks. FYI - The problem with compiled regex is that it wont really help here, because you compile the regex for each request. It would be better to compile it once at run time then use it in this method. Compilation takes some time and much more memory, but great for massive (100mb+) documents or loops (hundreds of thousands). Uncompiled regex gets cached and for the size of this text wont really have much of an affect. +1 – Piotr Kula Mar 21 '15 at 10:23
2

A simpler way to handle this would be to download it, then split:

    using System;
    using System.Net.Http;

    private async void getSite(string url)
    {
        HttpClient hc = new HttpClient();
        HttpResponseMessage response = await hc.GetAsync(new Uri(url, UriKind.Absolute));
        string source = await response.Content.ReadAsStringAsync();

        //process the source here

    }

To process the source, you can use the method described here in the article on Getting Content From Between HTML Tags

Vamsi
  • 4,237
  • 7
  • 49
  • 74
user151243
  • 21
  • 1