4

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.

Is there a more elegant way to do that? Actually it looks like that (not completed yet):

SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;

On my way trying different commands I got another problem:

When I searched for Ü I got this:

ERROR: invalid byte sequence for encoding "UTF8": 0xdc27

Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?

Kind regards from Germany. :)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Stefan
  • 41
  • 1
  • 2

2 Answers2

3

Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:

SHOW client_encoding;   -- in your actual session

And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL

The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.

As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.

As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.

Related:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I'm not so sure about the client/server encoding mismatch. In my experience, those kinds of mapping failures usually give "character has no equivalent" errors. "Invalid byte sequence" sounds more like the `client_encoding` is set to UTF8, but the client program is still sending ANSI data. – Nick Barnes Jul 10 '14 at 14:19
  • @NickBarnes: Well, the client sends data encoded according to the `clent_encoding`. `client_encoding` goes both ways. So we are probably talking about the same thing here. – Erwin Brandstetter Jul 10 '14 at 14:27
  • 1
    Not quite. The server interprets the client's data according to the `client_encoding`, but the client can send whatever it wants. For example, if I fire up `psql` in Windows, it defaults to WIN1252. If I run `SET client_encoding TO 'UTF8'` and `SELECT 'Ü'`, I get an "invalid byte sequence" error. `psql` has no idea that anything has changed; it's still sending its data as ANSI. – Nick Barnes Jul 10 '14 at 14:46
  • @NickBarnes: Good point. My description in the previous comment was not quite correct. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest ... `Change settings... Window -> Translation -> Remote character set: UTF-8` – Erwin Brandstetter Jul 10 '14 at 15:04
  • 1
    @Erwin: That's the point. i am using SuperPutty and had to change character set to UTF-8. Though the problem should be solved - but: Select an address with an umlaut inside the street, the umlaut still isn't displayed correctly. Instead I am getting `Ã` instead of `ß`. However, my problem is solved. Thank y'all! – Stefan Jul 10 '14 at 15:29
  • @Stefan: Once you have entered incorrect data in your tables, then it's there for good until you fix it. – Erwin Brandstetter Jul 10 '14 at 15:40
0

Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.

In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:

ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"

dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"

cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
  from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))

I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

lfurini
  • 3,729
  • 4
  • 30
  • 48
Stefan
  • 41
  • 1
  • 2