I'm getting some content from different websites, some of of them send this content type header:
Content-Type: text/html; charset=utf-8
and others
Content-Type: text/html
I used a Python script using the requests library to check the encoding in bulk:
for site in sites:
r = requests.get(site)
print r.encoding
It printed UTF-8
for some websites and for the others ISO-8859-1
, I'm storing these results in a mysql database the collation is latin1_swedish_ci
which is the default (I'm using XAMPP).
The issue is that these articles have special characters like é ë ü ï
for some websites these characters become like this ë
which should be ë
, and the others work fine.
What I'm looking for is a solution to get the same result in both cases, I searched and found some solutions that don't work in both cases, if the string is ok it'll become messed :
$str = "ë";
echo utf8_decode($str);
First I'm sorry about this question, but I had to post it beccause I don't know anything about encoding, so what can I do to get the same result ?
If it matters I'm using QueryPath to parse the html of these sites, and I'm passing as the options array('convert_to_encoding' => 'utf-8');