1

The database I'm working with has sometimes UTF-8 encoded surrogate pairs, for example:

SELECT id, SUBSTR(description, 72, 4) as str, RAWTOHEX(SUBSTR(description, 72, 4)) as hex
FROM table WHERE id = 13;

returns

ID str hex
13 -S EDA0B5EDBBBD2D53

When I try to retrieve this entry, I'm getting UnicodeDecodeError as this is not valid UTF-8 sequence. How can I convince cx_Oracle (I have to use 5.2.1 version for now) to unmangle this sequence (I know that Django fixes this error somehow, but don't see how it's done).

  • You are using Oracle DB, and the feild is NVARCHAR? – Nir Elbaz Dec 31 '20 at 10:59
  • Possibly related: https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python – tripleee Dec 31 '20 at 11:08
  • Can you give the original `description` string value, for reference? There are no surrogate pairs in UTF-8. Surrogate pairs are a feature of UTF-16, where it's sometimes necessary to use two byte pairs to encode one character. For example, the `PILE OF POO` character () becomes the sequence `0xD83D 0xDCA9` in UTF-16. This does not happen in UTF-8, which has a variable byte length, and this character becomes `0xF0 0x9F 0x92 0xA9`. This happens to be 4 bytes long, but it's not a surrogate pair. It's just a regular UTF-8 encoded character. – Tomalak Dec 31 '20 at 11:20
  • I've edited the example. I know that surrogate pairs are the feature of UTF-16, it's just that I encounter UTF-8 characters that after decoding are surrogate pairs, in this case this decodes to `'\ud835\udefd-S'`, that redecoded in UTF-16 gives value in `str` field. I find this issue in VARCHAR2 and CLOB fields. – Dominik Murzynowski Dec 31 '20 at 13:39

1 Answers1

0

Use cx_Oracle 8.x which uses UTF-8 as the default encoding for all character data. If you are not using cx_Oracle 8.x you need to connect as follows:

connection = cx_Oracle.connect("user/pw@host/service_name", encoding="UTF-8", nencoding="UTF-8")

If you are already using cx_Oracle 8.x, then more information is required. Take a look at this documentation and add the information you get into your question, if needed!

Anthony Tuininga
  • 6,388
  • 2
  • 14
  • 23
  • I have to use 5.2.1 version (updated question). I have tried to set encoding and nencoding, as suggested, but that doesn't seem to make any difference, still getting `UnicodeDecodeError`. – Dominik Murzynowski Jan 04 '21 at 10:28
  • Please include information on the database column type. Also include the code you are using to fetch the data and display it. It is also useful to print out the value of `connection.encoding` and `connection.nencoding`. – Anthony Tuininga Jan 04 '21 at 18:14
  • It would also be useful to explain why you still need to use cx_Oracle 5.2.1? You can also use a newer version (like 8.x) to determine if you still have the issue with that version. – Anthony Tuininga Jan 04 '21 at 18:15