0

I can't seem to store unicode characters correctly in Postgres. They are shown as code representations, e.g. <C3><A5> instead of å for example.

My database was created with UTF-8 as the encoding. I have tried storing the unicode strings with psycopg2 like so:

field = myUnicodeString.encoding('utf-8')
cursor.execute("INSERT INTO mytable (column1) VALUES (%s)", (field))

field = myUnicodeString
cursor.execute("INSERT INTO mytable (column1) VALUES (%s)", (field))

but both alternatives stores incorrect characters. Do I need to set the charset for the table as well, or what can the problem be here?

UPDATE 1:

I have discovered that I can't even type non-ascii characters – like å, ä and ö – in my terminal. I'm on an Ubuntu 12.04 server. Could this in any way be related to the language settings of the server itself?

UPDATE 2

I am now able to type non-ascii characters in my terminal during a SSH session. I changed the locale settings and rebooted the server. Moreover, I am able to manually store non-ascii characters in my UTF-8 database (in psql: INSERT INTO table (column) VALUES ('ö')). The char is displayed correctly in psql.

When I SELECT convert_to(column, 'utf-8') FROM table with the manually inserted row in the table, the char ö is displayed as \xc383c2b6 in psql.

When I do print repr('ö') in Python, I get '\xc3\xb6'. I'm really trying to understand how to debug this, but I'm not sure what to look for.

  • A couple easy things to check: if you type e.g. `psql -l`, does the database you're using really does use UTF8, and also if you print the prepared statement via `cursor.mogrify("INSERT ...", (field))`, are the unicode characters printed correctly? – CmdrMoozy Dec 23 '13 at 21:34
  • Yes, the database is indeed configured to use UTF-8 as its encoding. And the result of mogrify displays all characters with their proper encoding. –  Dec 23 '13 at 21:39
  • Funny thing is, I can't even manually insert characters like, å, ä and ö. Inside psql, those characters are not printed to the screen. –  Dec 23 '13 at 21:49
  • When you say "they are shown as ...", shown by whom? – Peter Eisentraut Dec 23 '13 at 22:01
  • Shown by the postgres user quyering the table inside psql. –  Dec 23 '13 at 22:03
  • What is your terminal? Does it support and is it configured to display characters encoded as UTF-8 and have you told `psql` that it can? – CB Bailey Dec 24 '13 at 00:28
  • I'm using bash. I can type the characters in local sessions, but not when I'm ssh-ing to my server. I don't know how to set up the kind of configuration you're mentioning. What would be the first thing to try? –  Dec 24 '13 at 00:32
  • 1
    This question might be helpful: http://stackoverflow.com/questions/2038733/how-do-i-change-a-shell-scripts-character-encoding – CmdrMoozy Dec 24 '13 at 01:32
  • You've left out some crucial info: In particular, PostgreSQL version, `SHOW server_encoding` or `\l+` output in psql, `locale` command output in your terminal, `psycopg2` version. Also, "they are shown" by what, where, how? – Craig Ringer Dec 24 '13 at 02:57
  • Hi, postgres version is 9.1.10, `SHOW server_encoding` returns UTF8, psycopg2 version is 2.4.5 and the locale settings, well, I'm not sure about those. I've been fiddling around with them and I know at least LANG and LANGUAGE are now set to sv_SE. –  Dec 24 '13 at 15:12

1 Answers1

2

It isn't clear that you've confirmed whether the characters are stored incorrectly, or retrieved and displayed incorrectly. Nor is it clear whether the problem is in working with them in PostgreSQL, or before that in Python.

In this case, "å" is unicode code-point U+00E5, encoded in utf-16 BE as 0x00E5 or in utf-8 as 0xc3 0xa5 . That matches what you're seeing - a utf-8 byte sequence representation of "å" - so I suspect your terminal is misconfigured and just can't display it, or is trying to interpret it as latin-1 and doesn't have the right characters in the font for the resulting mangled text:

>>> print u'å'.encode("utf-8").decode("latin-1")
Ã¥

so it's showing the codepoints instead.

It doesn't help that your Python code is nonsense:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'encoding'

I think you meant "encode". It's unnecessary to do this anyway; Python's psycopg2 happily works with unicode string objects directly:

>>> conn = psycopg2.connect("dbname=regress")
>>> curs = conn.cursor()
>>> curs.execute("SELECT %s", (u'áéíóú',));
>>> print curs.fetchone()[0]
áéíóú

With encoding problems you need to trace things through, step by step, to determine where the text encoding is being mishandled.

There's nowhere near enough information to answer a question like this. All I can really offer is general advice. At every step, confirm that you respect the encoding of the input, and that the output from one step is in the same encoding the next step expects as input.

First, you need to make sure that your unicode strings are correct in Python. print repr(mystring) will be useful for this, to see the string data. Then you should stop explicitly encoding them when passing them to psycopg2; just let psycopg2 deal with it.

Next step will be to examine them in the database using psql. Even if they don't display correctly on your terminal you can check if they're right in the database with the convert_to function, which takes a database field or string literal as input and outputs the byte sequence in the desired encoding. So, eg:

SELECT convert_to(column1, 'utf-8') FROM mytable;

and make sure that the hex byte sequence returned matches what it should be for the utf-8 encoding of the text you sent.

Continue with this process. At each step, examine the string bytes to make sure they match what they should be, until you find the stage that's mishandling the text.

I assure you that neither PostgreSQL nor psycopg2 do store Unicode characters incorrectly. In this case it could be as simple an issue as your terminal being set up wrong, or it could be that something in the text handling path is using the wrong input encoding, so you encode something as utf-8 then decode it as latin-1 (for example).

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • Hi, thanks a lot for your tips. After fiddling with the locale settings of the server, I'm able to type the characters å, ä, and ö in my SSH session. I verified that the unicode strings in Python are correct. However, using the `convert_to` function in psql, I found that the characters are not stored correctly. Char `ö` is encoded as `c3b6` which is something completely different. Note that querying the database now returns `?` instead of ``, for example. That is, that's what I see in the terminal in psql. Not sure what to do now. –  Dec 24 '13 at 15:07
  • I tried `sv_SE` and `sv_SE.UTF-8` as locale settings. But it should have been `sv_SE.utf8`. Everything works now. Thank you. –  Dec 24 '13 at 18:26