I've received a unicode string from the wild that causes some of our psycopg2 statements to fail.
I have reduced the problem down to a SSCE:
import psycopg2
conn = psycopg2.connect(...)
cur = conn.cursor()
x = u'\ud837'
cur.execute("SELECT %s", (x,))
print cur.fetchone()
Running this gives the following exception:
Traceback (most recent call last):
File ".../run.py", line 65, in <module>
cur.execute("SELECT %s AS test", (x,))
psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb7
Based on some of the comments, it has become clear that this particular character is one half of a surrogate pair, making it invalid to live on its own.
Specifically then, I am looking for a mechanism to detect when a string contains an incomplete surrogate pair in Python 2.
One such method I have found that leads to an exception is trying x.encode('utf16').decode('utf16')
, however, since I don't totally understand the risks associated, I would be somewhat concerned here.
Edit: Reduced SSCE string to single character causing the problem, added information based on comments.