1

I'm getting some content from different websites, some of of them send this content type header:

Content-Type: text/html; charset=utf-8

and others

Content-Type: text/html

I used a Python script using the requests library to check the encoding in bulk:

for site in sites:
    r = requests.get(site)

    print r.encoding

It printed UTF-8 for some websites and for the others ISO-8859-1, I'm storing these results in a mysql database the collation is latin1_swedish_ci which is the default (I'm using XAMPP).

The issue is that these articles have special characters like é ë ü ï for some websites these characters become like this ë which should be ë, and the others work fine.

What I'm looking for is a solution to get the same result in both cases, I searched and found some solutions that don't work in both cases, if the string is ok it'll become messed :

$str = "ë";

echo utf8_decode($str);

First I'm sorry about this question, but I had to post it beccause I don't know anything about encoding, so what can I do to get the same result ?

If it matters I'm using QueryPath to parse the html of these sites, and I'm passing as the options array('convert_to_encoding' => 'utf-8');

Pierre
  • 12,468
  • 6
  • 44
  • 63
  • Um, seems pretty simple: Make sure it's UTF-8 between fetching it from the web and inserting it into the database. Apparently you know the encoding when you fetch the site (insofar one can know the encoding of a byte stream in the real world). –  Jan 03 '14 at 14:07
  • @delnan so if my the string contains `ë` what should I do before inserting it in the database, currently it's showing as it is, not as `ë`, and thanks. – Pierre Jan 03 '14 at 14:11
  • Make sure your php.ini has default_charset=utf-8, or set it at the start of your script with `ini_set('default_charset', 'utf-8');` – RaggaMuffin-420 Jan 03 '14 at 14:12
  • Read: http://www.joelonsoftware.com/articles/Unicode.html – Sumurai8 Jan 03 '14 at 14:29
  • @Peter In the Python script, first decode from whatever encoding it's in, then encode in UTF-8. –  Jan 03 '14 at 14:34
  • see this : http://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8 – web-nomad Jan 03 '14 at 14:51

1 Answers1

0

Set your database collation to utf8_unicode_ci (phpMyAdmin > select the DB > Operations > Collation). This character encoding can handle a wider range of "exotic" characters than latin1.

You will probably need to re-insert the content with dodgy characters again.

I've never had dodgy character display problems since using this collation for my databases, combined with using the correct UTF-8 charset meta tag in my HTML documents:

<meta charset="utf-8">

These two actions combined should handle the problem.

Josh Harrison
  • 5,927
  • 1
  • 30
  • 44