15

I am extremely confused and puzzled by how I store strings with unusual characters (to someone who is used to dealing with a UK English character set) in them.

Here is my example.

I have this name: Bientôt l'été

This is how I created my table:

CREATE TABLE MyTable(
    'my_id' INT(10) unsigned NOT NULL,
    'my_name' TEXT CHARACTER SET utf8 NOT NULL,
    PRIMARY KEY(`my_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Using this simplified python script I am trying to insert the string into a MySQL database and table:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb

mystring = "Bientôt l'été"

myinsert = [ { "name" : mystring.encode("utf-8").strip()[:65535], "id" : 1 } ]

con = None
con = MySQLdb.connect('localhost', 'abc', 'def', 'ghi');
cur = con.cursor()
sql = "INSERT INTO 'MyTable' ( 'my_id', 'my_name' ) VALUES ( %(id)s, %(name)s ) ; "
cur.executemany( sql, myinsert )
con.commit()
if con: con.close()

If I then try to read the name in the database it is stored as: Bientôt l'été

I want it to read: Bientôt l'été

How do I get the python script/MySQL database to do this? I think this is something to do with the character set and how it is set but I can't find a simple web page that explains this without any technical jargon. I've been struggling with this for hours!

I have looked at this and I see character_set_server is set as latin1 but I don't know if this is the problem or how to change it:

mysql> show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
user1464409
  • 1,032
  • 5
  • 18
  • 31
  • 1
    as far as i know `u"string"` is how you specify a string in utf8 format – Gung Foo Feb 11 '13 at 11:58
  • If this is Python 2, you do *not* need to call `encode()`. Only use that if `mystring` is a `unicode` object. Because you set the source code encoding to UTF8, your `mystring` is *already encoded*. – Martijn Pieters Feb 11 '13 at 12:38

4 Answers4

14

Did you try, this query set names utf8;

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb

mystring = "Bientôt l'été"

myinsert = [{ "name": mystring.encode("utf-8").strip()[:65535], "id": 1 }]

con = MySQLdb.connect('localhost', 'abc', 'def', 'ghi');
cur = con.cursor()

cur.execute("set names utf8;")     # <--- add this line,

sql = "INSERT INTO 'MyTable' ( 'my_id', 'my_name' ) VALUES ( %(id)s, %(name)s ) ; "
cur.executemany( sql, myinsert )
con.commit()
if con: con.close()
Adem Öztaş
  • 20,457
  • 4
  • 34
  • 42
  • 4
    Yes, that was my problem (along with a complete misunderstanding of character sets!). I ended up using this `con.set_character_set('utf8') cur.execute('SET NAMES utf8;') cur.execute('SET CHARACTER SET utf8;') cur.execute('SET character_set_connection=utf8;')` – user1464409 Feb 11 '13 at 14:28
6

Set the default client character set:

<?php
$con=mysqli_connect("localhost","my_user","my_password","my_db");
// Check connection
if (mysqli_connect_errno())
  {
  echo "Failed to connect to MySQL: " . mysqli_connect_error();
  }

// Change character set to utf8
mysqli_set_charset($con,"utf8");
mysqli_close($con);
?>
Iman Marashi
  • 5,593
  • 38
  • 51
  • working for me, the text row in the DB was set to something like latin1_swedish_ci and was returning null values, now it works. – Juan Boero Jul 18 '17 at 23:29
3

Your problem is with how you display the data when you read it from the database. You are looking at UTF-8 data mis-interpreted as Latin 1.

>>> "Bient\xf4t l'\xe9t\xe9"
"Bientôt l'été"
>>> "Bient\xf4t l'\xe9t\xe9".encode('utf8').decode('latin1')
"Bientôt l'été"

The above encoded a unicode string to UTF-8, then misinterprets it as Latin 1 (ISO 8859-1), and the ô and é codepoints, which were encoded to two UTF-8 bytes each, are re-interpreted as two latin-1 code points each.

Since you are running Python 2, you shouldn't need to .encode() already encoded data. It'd be better if you inserted unicode objects instead; so you want to decode instead:

myinsert = [ { "name" : mystring.decode("utf-8").strip()[:65535], "id" : 1 } ]

By calling .encode() on the encoded data, you are asking Python to first decode the data (using the default encoding) so that it then can encode for you. If the default on your python has been changed to latin1 you would see the same effect; UTF-8 data interpreted as Latin 1 before being re-encoded to Latin-1.

You may want to read up on Python and Unicode:

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Martijn, this is Python 2. How can I get the data to read back as Latin-1? If I change the line to `myinsert = [ { "name" : u"Bientôt l'été", "id" : 1 } ]` and then look at the MySQL database it reads as `Bientôt l'été` which is what I want (except I want to do this for a variable string eg `mystring`). Wouldn't this suggest its not a display problem? – user1464409 Feb 11 '13 at 12:35
  • @user1464409 Why don't you make `mystring` a `unicode` object in the first place: `mystring = u"Bientôt l'été"` – Janne Karila Feb 11 '13 at 12:38
  • @user1464409: So you want to *decode* instead. The `.encode()` does not make sense, and could very well be the *cause* of your problem. – Martijn Pieters Feb 11 '13 at 12:40
  • Janne, doing this does work but I don't know how to make `mystring` a unicode object. I know I can do it with `mystring = u"help!"` but I don't know how to do this with when "help!" is obtained from another variable or dictionary for example. I tried unicode() but this didn't seem to work and I don't know if this was the right thing to do or not. – user1464409 Feb 11 '13 at 12:53
  • @user1464409: see my answer, I added how to decode to `unicode` already. – Martijn Pieters Feb 11 '13 at 12:53
  • Martijn, I tried to `.decode()` instead and I came up with this error when using another string: `UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 8: ordinal not in range(128)` when I tried this string: `Bookworm™ Adventures`. It is clearly complaining about the "TM" character but I am really confused. – user1464409 Feb 11 '13 at 12:55
  • @user1464409: That's an *encode* error; are you printing the output of `.decode()`? Python then tries to encode the `unicode` object again to match your terminal, that doesn't always work. The *decode* worked. – Martijn Pieters Feb 11 '13 at 12:55
  • No, I'm not trying to print it out, this is the error I am getting when I try to insert into the database. I think I am destined never to understand this. – user1464409 Feb 11 '13 at 13:02
  • @user1464409: Ah, you need to read my answer more *closely*. I didn't say `.decode()`, I said `.decode('utf8')`.. – Martijn Pieters Feb 11 '13 at 13:04
  • I did read your answer closely, I was just abbreviating my response (sorry!). I have just had a "Eureka!" moment though. The answer is I also need to set the encoding being used in MySQLdb like this: `con.set_character_set('utf8') cur.execute('SET NAMES utf8;') cur.execute('SET CHARACTER SET utf8;') cur.execute('SET character_set_connection=utf8;')` – user1464409 Feb 11 '13 at 13:07
  • @user1464409: I was assuming you already had that set, since you claimed that `INSERT` for a `unicode` literal (`u'something'`) was working, but I now see that you only used ASCII codepoints in that value. – Martijn Pieters Feb 11 '13 at 13:11
1
<?php
//Set Beginning of php code:
header("Content-Type: text/html; charset=UTF-8");
mysql_query("SET NAMES 'utf8'"); 
mysql_query('SET CHARACTER SET utf8');

//then create the connection 
$CNN=mysql_connect("localhost","usr_urdu","123") or die('Unable to Connect');
$DB=mysql_select_db('db_urdu',$CNN)or die('Unable to select DB');
Naeem
  • 7
  • 1