1

Some text is being fetched from an nvarchar column on a SQL Server database using SQLAlchemy and then written to a Postgres database with column type character varying. Python throws the following exception when the source data contains the character :

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d'
in position 34: surrogates not allowed

The SQLAlchemy column type is set to String. I tried setting the column type to Unicode and separately set the column collation to SQL_Latin1_General_CP1_CI_AS, with no luck. The driver being used is FreeTDS.

Why can't Python encode this string? Does the problem lie with our use of SQLAlchemy, Python or Postgres? This error is making me .

Petrus Theron
  • 27,855
  • 36
  • 153
  • 287

1 Answers1

1

The codepoint \U1f615 () can be represented by the two surrogates \ud83d and \ude15. Somehow your SQL-Server, which uses internally UTF16 was decoded as UCS2, so that the surrogates are not properly decoded. So the problem is the SQL-Server. If you cannot correctly read the data, you have to manually correct the unicode strings, like so (python3):

def surrogate_to_unicode(sur):
    a, b = sur
    return chr(0x10000 + ((ord(a)-0xd800)<<10) + (ord(b)-0xdc00))

text = '\ud83d\ude15'
text = re.sub('[\ud800-\udbff][\udc00-\udfff]', lambda g:surrogate_to_unicode(g.group(0)), text)
Daniel
  • 42,087
  • 4
  • 55
  • 81