I have a method that download a webpage and extract the title tag but depending of the website, the result can be encoded or in the wrong character set. Is there a bulletproof way to get websites title when they are encoded differently?
Some urls that i have tested with different result:
- https://fr.wikipedia.org/wiki/Québec return "Québec — Wikipédia". The result is good.
- http://www.remax-quebec.com/fr/index.rmx return "Condo, chalet ou maison à vendre avec un courtier immobilier | RE/MAX Québec".
- http://www.restomontreal.ca/ return "Restaurants Montr�al | RestoMontreal"
The method i use:
private string GetUrlTitle(Uri uri)
{
string title = "";
using (HttpClient client = new HttpClient())
{
HttpResponseMessage response = null;
response = client.GetAsync(uri).Result;
if (!response.IsSuccessStatusCode)
{
string errorMessage = "";
try
{
XmlSerializer xml = new XmlSerializer(typeof(HttpError));
HttpError error = xml.Deserialize(response.Content.ReadAsStreamAsync().Result) as HttpError;
errorMessage = error.Message;
}
catch (Exception)
{
errorMessage = response.ReasonPhrase;
}
throw new Exception(errorMessage);
}
var html = response.Content.ReadAsStringAsync().Result;
title = Regex.Match(html, @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
}
if (title == string.Empty)
{
title = uri.ToString();
}
return title;
}
` , `
`, `` etc. which isn't valid as an xml. – Eser Apr 23 '16 at 18:55