Postgres - character sets and encodings

Question

I was wondering if someone can help me understand what's going on/wrong with my Postgres data please...

I'll explain things below - but I guess ultimately the questions I have are :

What characterset/encoding should I be using (i.e. what is best practise)?
IF the answer is UTF8, then will certain characters (e.g. UK pound symbols) always look "funny" in the database?

I've got a database that has a table with data about flights (although obviously it could be anything really), defined as follows...

CREATE TABLE public.flight (
    flightid integer DEFAULT nextval('public.flight_seq'::regclass) NOT NULL,
    tripid integer NOT NULL,
    flightdatedeparted date NOT NULL,
    flightairportdeparted text NOT NULL,
    flightairportarrived text NOT NULL,
    flightairline text NOT NULL,
    flightdetails text,
    flightdayflightnumber integer DEFAULT 1 NOT NULL,
    flightdistance numeric
);

Now, when I enter data into it via a web front end connected to this database then I end up with data something like...

holidayinfo=# select distinct * from flight where flightid=97;
-[ RECORD 1 ]---------+---------------------------------------
flightid              | 97
tripid                | 36
flightdatedeparted    | 2004-05-14
flightairportdeparted | LHR
flightairportarrived  | WAW
flightairline         | British Airways
flightdetails         | Hotline, ┬ú82.40, BA850, 13:40 -> 17:05
flightdayflightnumber | 1
flightdistance        | 912.7

However, the data that I'd entered into the web form for the field "flightdetails" was actually...

Hotline, £82.40, BA850, 13:40 -> 17:05

Now, when I dump the data and look at it in Notepad++, depending on what encoding I use then sometimes I see it correctly as the pound symbol (when I choose ANSI) and other times it's incorrect as xA3 (when I choose UTF8).

At least when it's stored in Postgres as the "funny" value then it also displays correctly on my webpage when I retrieve the data - so that's good.

If I try to manually update the value via psql then I get the following...

holidayinfo=# update flight set flightdetails='Hotline, £82.40, BA850, 13:40 -> 17:05' where flightid=97;
ERROR:  invalid byte sequence for encoding "UTF8": 0x9c

In terms of how my database is created and what client encoding its using then I've got the following...

holidayinfo=# \l
                                                   List of databases
    Name     |  Owner   | Encoding |           Collate           |            Ctype            |   Access privileges
-------------+----------+----------+-----------------------------+-----------------------------+-----------------------
 holidayinfo | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 |
 leagueinfo  | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 |
 postgres    | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 |
 template0   | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 | =c/postgres          +
             |          |          |                             |                             | postgres=CTc/postgres
 template1   | postgres | UTF8     | English_United Kingdom.1252 | English_United Kingdom.1252 | =c/postgres          +
             |          |          |                             |                             | postgres=CTc/postgres
(5 rows)


holidayinfo=# show client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

Maybe this is all working as designed, but I'm just confused as to how things should be?

Ultimately, I'd love to be able to have the data stored so that I can see it as the pound sign AND be entered/retrieved/displayed as the pound sign.

The former is desirable so that if ever I need to look at the data then I can see what the real data is - not have to make assumptions on what character "┬ú" really means.

Also, this problem scales up when there are other characters having the same "issue" such as a hyphen (-) showing as "ÔÇô" and or an apostrophe (') showing as "ÔÇÖ".

Thanks in advance!

A flagrant [mojibake](https://en.wikipedia.org/wiki/Mojibake) case e.g. `'£'.encode('utf-8').decode('cp437')` returns `'┬ú'` or `"ÔÇô".encode('cp850').decode('utf-8')` returns `'–'` (the latter character `–` is `U+2013` _En Dash_); examples in Python. Read and follow [UTF-8 Everywhere](https://utf8everywhere.org/) and/or [UTF-8 all the way through](https://stackoverflow.com/questions/279170/). And `'ÔÇÖ'.encode('cp852').decode('utf-8')` returns `'’'` (U+2019 _Right Single Quotation Mark_).. — JosefZ, Apr 11 '21 at 14:12

score 2 · Answer 1 · answered Apr 13 '21 at 06:42

2

You must be viewing the data with psql using cmd.exe with code page CP-850.

The data in your database are wrong, because the application that inserted them had client_encoding set to WIN1252 while feeding the database UTF-8 characters.

So £, which is 0xC2A3 in UTF-8, is interpreted as two characters, namely Â (0xC2) and £ (0xA3). They are converted to UTF-8 and stored in the database as 4 bytes (0xC382 and 0xC2A3). When you view them with psql, they are converted back to WINDOWS-1252, but cmd.exe interprets them as CP-850 and renders them as ┬ú.

The fix is to change client_encoding to UTF8 in the application that inserts the data into the database.

answered Apr 13 '21 at 06:42

Laurenz Albe

209,280
17
206
263

Thanks! That coincides with other stuff I'd found and managed to get things working... mostly. So, the main thing that got it working on Win10 with Postgres 13 was to edit the psql startup script (C:\Program Files\PostgreSQL\13\scripts\runpsql.bat) to add 2 lines "SET PGCLIENTENCODING=utf-8" and "chcp 65001", before the final line that starts psql.exe. Then when the cleaned up data (load the SQL_ASCII dump from PG11 into Notepad++, change to ANSI encoding, global replace the bad chars, convert to UTF8, save) was imported it viewed ok in psql. – darrensunley Apr 13 '21 at 16:41
I decided to try to then export the Win10 PG13 UTF8 data and import it into a new UTF8 database in the Win7 PG11 environment, then if that worked I planned to drop the SQL_ASCII database and be gone with this nightmare... that was the plan anyway! Unfortunately, while the data imports fine into the newly created PG11 UTF8 database, the Win7 psql has the same viewing problem - BUT the same solution doesn't work. Is there some other difference between Win7 and Win10 that would/could affect things? – darrensunley Apr 13 '21 at 16:45
That is version independent. In your comments you are suddenly referring to `SQL_ASCII`, and you move data to and fro - I cannot tell from your brief description where you went wrong. But you understand that it was not (only) a `cmd.exe` codepage problem, but the data in your database were wrong. What I do to figure out what bytes are in the database is cast the string to `bytea`. – Laurenz Albe Apr 14 '21 at 02:28
Sorry - I only added the bit about SQL_ASCII in the comment as when I posted the question I hadn't realised the significance of it. I've learned a lot whilst reading many (!) articles about this. I'm confident that the data problem has now been solved - but as your 1st reply mentioned, the issue was that psql wasn't displaying the (corrected) data appropriately. The updated script fixed this in the new Win10/PG13 env, but didn't work in the old Win7/PG11 env. Ultimately I'll be moving away from the old env - but it's more of a personal challenge now to figure out why it's behaving differently! – darrensunley Apr 14 '21 at 07:58
Ah, I see. Yes, there will be no encoding conversion to or from an `SQL_ASCII` database. – Laurenz Albe Apr 14 '21 at 08:00

Postgres - character sets and encodings

1 Answers1