4

I have an arabic string say

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

I want to write this text arabic converted into mySql database. I tried using

txt = smart_str(txt)

or

txt = text.encode('utf-8') 

both of these din't work as they coverted the string to

u'Arabic (\xd8\xa7\xd9\x84\xd8\xb7\xd9\x8a\xd8\xb1\xd8\xa7\xd9\x86)' 

Also my database character set is already set to utf-8

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;

So due to this new unicodes, my database is displaying the characters related to the encoded text. Please help. I want my arabic text to be retained.

Also does quick export of this arabic text from MySQL database write the same arabic text into files or will it again convert it back to unicode?

I used the foolowing code to insert

cur.execute("INSERT INTO tab1(id, username, text, created_at) VALUES (%s, %s, %s, %s)", (smart_str(id), smart_str(user_name), smart_str(text), date))

Earlier to this when I didn't use smart_str, it throws an error saying only 'latin-1' is allowed.

kkoe
  • 143
  • 2
  • 10

2 Answers2

5

To clarify a few things, because it will help you along in the future as well.

txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'

This is not an Arabic string. This is a unicode object, with unicode codepoints. If you were to simply print it, and if your terminal supports Arabic you would get output like this:

>>> txt = u'Arabic (\u0627\u0644\u0637\u064a\u0631\u0627\u0646)'
>>> print(txt)
Arabic (الطيران)

Now, to get the same output like Arabic (الطيران) in your database, you need to encode the string.

Encoding is taking these code points; and converting them to bytes so that computers know what to do with them.

So the most common encoding is utf-8, because it supports all the characters of English, plus a lot of other languages (including Arabic). There are others too, for example, windows-1256 also supports Arabic. There are some that don't have references for those numbers (called code points), and when you try to encode, you'll get an error like this:

>>> print(txt.encode('latin-1'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-14: ordinal not in range(256)

What that is telling you is that some number in the unicode object does not exist in the table latin-1, so the program doesn't know how to convert it to bytes.

Computers store bytes. So when storing or transmitting information you need to always encode/decode it correctly.

This encode/decode step is sometimes called the unicode sandwich - everything outside is bytes, everything inside is unicode.


With that out of the way, you need to encode the data correctly before you send it to your database; to do that, encode it:

q = u"""
    INSERT INTO
       tab1(id, username, text, created_at)
    VALUES (%s, %s, %s, %s)"""

conn = MySQLdb.connect(host="localhost",
                       user='root',
                       password='',
                       db='',
                       charset='utf8',
                       init_command='SET NAMES UTF8')
cur = conn.cursor()
cur.execute(q, (id.encode('utf-8'),
                user_name.encode('utf-8'),
                text.encode('utf-8'), date))

To confirm that it is being inserted correctly, make sure you are using mysql from a terminal or application that supports Arabic; otherwise - even if its inserted correctly, when it is displayed by your program - you will see garbage characters.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
2

Just execute SET names utf8 before executing your INSERT:

cur.execute("set names utf8;")

cur.execute("INSERT INTO tab1(id, username, text, created_at) VALUES (%s, %s, %s, %s)", (smart_str(id), smart_str(user_name), smart_str(text), date))

Your question is very similar to this SO post, which you should read.

Community
  • 1
  • 1
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Hi sir, Thanks for the replay as I mentioned earlier, I could see the utf-8 text in my database but that utf-8 text is not arabic. – kkoe Dec 03 '15 at 04:50
  • When I used smart_str() it is converting the \u0627\ which is arabic to \xd8\ something else – kkoe Dec 03 '15 at 04:51
  • Just insert the raw arabic. No need to convert it unicode. – Tim Biegeleisen Dec 03 '15 at 04:52
  • When I input the raw text without using smart_str() then it throws >>UnicodeEncodeError: 'latin-1' codec can't encode character – kkoe Dec 03 '15 at 04:53
  • Sir, Can you please help – kkoe Dec 03 '15 at 05:01
  • Read [this Stack Exchange DBA article](http://dba.stackexchange.com/questions/87385/cant-insert-arabic-text-into-mysql-database-using-mysql-prompt). You can switch your dev environment to support inputting Arabic. After this, my answer should do the trick. – Tim Biegeleisen Dec 03 '15 at 05:16