0

Firstly, I am pretty new to python, so forgive me for all the n00b stuff. So the application logic in Python goes like this:

  1. I am sending and SQL Select to database and it returns an array of data.
  2. I need to take this data and use it in another SQL insert sentence.

Now the problem is, that SQL query returns me unicode strings. The output from select is something like this:

(u'Abc', u'Lololo', u'Fjordk\xe6r')

So first I was trying to convert it string, but it fails as the third element contains this german 'ae' letter:

for x in data[0]:
    str_data.append(str(x))

I am getting: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in position 6: ordinal not in range(128)

I can insert unicode straightly to insert also as TypeError occurs. TypeError: coercing to Unicode: need string or buffer, NoneType found

Any ideas?

Erki M.
  • 5,022
  • 1
  • 48
  • 74

2 Answers2

7

From my experiences, Python and Unicode are often a problem.

Generally speaking, if you have a Unicode string, you can convert it to a normal string like this:

normal_string = unicode_string.encode('utf-8')

And convert a normal string to a Unicode string like this:

unicode_string = normal_string.decode('utf-8')
Mezgrman
  • 876
  • 6
  • 11
  • 2
    `'utf-8'` is usually the right choice, but not always. You should use the same character set that your database is configured for. – Mark Ransom May 22 '13 at 18:05
  • Ok, i finally found how to force python to be in UTF-8 by default: def set_default_encoding(): import sys reload(sys) #to make setdefaultencoding available; IDK why sys.setdefaultencoding("UTF-8") – Erki M. May 22 '13 at 19:26
  • This sounds useful. I'll try it out too! – Mezgrman May 22 '13 at 20:09
4

The issue here is that str function tries to convert unicode using ascii codepage, and ascii codepage doesn't have mapping for u\xe6 (æ - char reference here).

Therefore you need to convert it to some codepage which supports the char. Nowdays the most usual is utf-8 encoding.

>>> x = (u'Abc', u'Lololo', u'Fjordk\xe6r')
>>> print x[2].encode("utf8")
Fjordkær
>>> x[2].encode("utf-8")
'Fjordk\xc3\xa6r'

On the other hand you may try to convert it to cp1252 - Western latin alphabet which supports it:

>>> x[2].encode("cp1252")
'Fjordk\xe6r'

But Eeaster european charset cp1250 doesn't support it:

>>> x[2].encode("cp1250")
...
UnicodeEncodeError: 'charmap' codec can't encode character u'\xe6' in position 6: character maps to <undefined>

The issue with unicode in python is very common, and I would suggest following:

  • understand what unicode is
  • understand what utf-8 is (it is not unicode)
  • understand ascii and other codepages
  • recommended conversion workflow: input (any cp) -> convert to unicode -> (process) -> output to utf-8
Community
  • 1
  • 1
Robert Lujo
  • 15,383
  • 5
  • 56
  • 73