0

I have a method that download a webpage and extract the title tag but depending of the website, the result can be encoded or in the wrong character set. Is there a bulletproof way to get websites title when they are encoded differently?

Some urls that i have tested with different result:

The method i use:

private string GetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {
        HttpResponseMessage response = null;

        response = client.GetAsync(uri).Result;

        if (!response.IsSuccessStatusCode)
        {
            string errorMessage = "";

            try
            {
                XmlSerializer xml = new XmlSerializer(typeof(HttpError));
                HttpError error = xml.Deserialize(response.Content.ReadAsStreamAsync().Result) as HttpError;
                errorMessage = error.Message;
            }
            catch (Exception)
            {
                errorMessage = response.ReasonPhrase;
            }

            throw new Exception(errorMessage);
        }

        var html = response.Content.ReadAsStringAsync().Result;
        title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
    }

    if (title == string.Empty)
    {
        title = uri.ToString();
    }

    return title;
}
Alexandre Jobin
  • 2,811
  • 4
  • 33
  • 43
  • I had a similar problem. First I used `Utf8Checker.IsUtf8` (somewhere on internet). If it is not utf-8 I checked the encoding (using HtmlAgilityPack) by cheking *meta* tag's *http-equiv* attribute. I tested it with your urls and seems to work. (BTW: problem is not at your code. Some sides aren't correctly coded/configured to return the correct encoding. So you have to do something more like browsers do) – Eser Apr 22 '16 at 22:11
  • HTML is essentially XML, try using an XML parser and search for the title attribute – Wobbles Apr 22 '16 at 22:47
  • @Wobbles `HTML is essentially XML` absolutely not. You can not parse an html document with xml parser. And the problem here is detecting the correct **encoding**. The correct way to do is (btw: that is not enough in this case as I mentioned already) using an html parser like *HtmlAgilityPack*, not an xml parser or regex. – Eser Apr 22 '16 at 23:06
  • @Eser actually YES you absolutely can, I wrote a script to do this in PHP because I wanted to fetch site title tags and it is the absolute best working solution. Dont knock it till youve tried it. – Wobbles Apr 23 '16 at 12:00
  • @Eser Little trickery in between, but running it through an XML parser was the key step that helped me extract tags even if poorly formatted that REGEX skipped over. I later learned there is a DOM formatter that could have done it easier perhaps, but none the less the XML function is what I use still because it has worked without flaw. – Wobbles Apr 23 '16 at 12:17
  • @Wobbles There are tags in html that doesn't require closing tags like `
    ` , `
    `, `` etc. which isn't valid as an xml.
    – Eser Apr 23 '16 at 18:55
  • @Wobbles just try to load the html with an xml parser **1)** a real xml `var xdoc1 = XDocument.Load("http://rss.cnn.com/rss/edition.rss");` **2)** [this page:](http://stackoverflow.com/questions/36803819/how-get-webpages-title-when-they-are-encoded-differently) `var xdoc2 = XDocument.Load("http://stackoverflow.com/questions/36803819/how-get-webpages-title-when-they-are-encoded-differently");` You'll get exception in the 2nd one... – Eser Apr 23 '16 at 19:02
  • @Wobbles http://stackoverflow.com/questions/32572928/parsing-an-html-document-using-an-xml-parser – Eser Apr 23 '16 at 19:10

3 Answers3

0

The charset is not always present in the header so we must also check for the meta tags or if it's not there neither, fallback to UTF8 (or something else). Also, the title might be encoded so we just need to decode it.

Results

The code below come from the github project Abot. I have modified it a little bit.

private string GetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {
        HttpResponseMessage response = client.GetAsync(uri).Result;

        if (!response.IsSuccessStatusCode)
        {
            throw new Exception(response.ReasonPhrase);
        }

        var contentStream = response.Content.ReadAsStreamAsync().Result;
        var charset = response.Content.Headers.ContentType.CharSet ?? GetCharsetFromBody(contentStream);                

        Encoding encoding = GetEncodingOrDefaultToUTF8(charset);
        string content = GetContent(contentStream, encoding);

        Match titleMatch = Regex.Match(content, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase);

        if (titleMatch.Success)
        {
            title = titleMatch.Groups["Title"].Value;

            // decode the title in case it have been encoded
            title = WebUtility.HtmlDecode(title).Trim();
        }
    }

    if (string.IsNullOrWhiteSpace(title))
    {
        title = uri.ToString();
    }

    return title;
}

private string GetContent(Stream contentStream, Encoding encoding)
{
    contentStream.Seek(0, SeekOrigin.Begin);

    using (StreamReader sr = new StreamReader(contentStream, encoding))
    {
        return sr.ReadToEnd();
    }
}

/// <summary>
/// Try getting the charset from the body content.
/// </summary>
/// <param name="contentStream"></param>
/// <returns></returns>
private string GetCharsetFromBody(Stream contentStream)
{
    contentStream.Seek(0, SeekOrigin.Begin);

    StreamReader srr = new StreamReader(contentStream, Encoding.ASCII);
    string body = srr.ReadToEnd();
    string charset = null;

    if (body != null)
    {
        //find expression from : http://stackoverflow.com/questions/3458217/how-to-use-regular-expression-to-match-the-charset-string-in-html
        Match match = Regex.Match(body, @"<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s""']*)?([^>]*?)[\s""';]*charset\s*=[\s""']*([^\s""'/>]*)", RegexOptions.IgnoreCase);

        if (match.Success)
        {
            charset = string.IsNullOrWhiteSpace(match.Groups[2].Value) ? null : match.Groups[2].Value;
        }
    }

    return charset;
}

/// <summary>
/// Try parsing the charset or fallback to UTF8
/// </summary>
/// <param name="charset"></param>
/// <returns></returns>
private Encoding GetEncodingOrDefaultToUTF8(string charset)
{
    Encoding e = Encoding.UTF8;

    if (charset != null)
    {
        try
        {
            e = Encoding.GetEncoding(charset);
        }
        catch
        {
        }
    }

    return e;
}
Alexandre Jobin
  • 2,811
  • 4
  • 33
  • 43
-1

you can try to get all bytes and convert to string with whatever encodng you want, just using Encoding class. It would be something like this:

private string GetUrlTitle(Uri uri)
{
    string title = "";

    using (HttpClient client = new HttpClient())
    {

        var byteData = await client.GetByteArrayAsync(url);
        string html = Encoding.UTF8.GetString(byteData);

        title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
    }

    return title;
}

I hope it helps you and if does, please mark it as answer.

Uilque Messias
  • 281
  • 2
  • 17
  • Problem in question is to detect the encoding, but you just assume it `UTF8` (just try your code with *all* urls in question) – Eser Apr 22 '16 at 22:16
-3

This may help you out. Use globalization

using System;
using System.Globalization;

public class Example
{
    public static void Main()
{
  string[] values = { "a tale of two cities", "gROWL to the rescue",
                      "inside the US government", "sports and MLB baseball",
                      "The Return of Sherlock Holmes", "UNICEF and         children"};

  TextInfo ti = CultureInfo.CurrentCulture.TextInfo;
  foreach (var value in values)
     Console.WriteLine("{0} --> {1}", value, ti.ToTitleCase(value));
   }
}

Check this out.https://msdn.microsoft.com/en-us/library/system.globalization.textinfo.totitlecase(v=vs.110).aspx

Lanshore
  • 43
  • 1
  • 1
  • 9