2

I am just doing a pull down of a database table and trying to read it into python like so:

with query(full_query_string) as cur: arr = cur.fetchall()

This produces the following error from the fetchall():

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 4: invalid continuation byte

If I select * I get this error whereas if I limit to a small number of rows, I don't get this error. I tried paying around with a few encodings following this SO post UnicodeDecodeError, invalid continuation byte but none of them are doing the trick. In a large db table where I don't know how the encoding could have gone wrong, what's the most efficient way to deal with this? Also, no specific row is a must-have, but I'd rather get all the rows other than whichever ones have this encoding problem.

Community
  • 1
  • 1
helloB
  • 3,472
  • 10
  • 40
  • 87

1 Answers1

5

Hey I know this is a super-late answer but in trying to debug a similar issue I found this fix from the vertica-python README:

While Vertica expects varchars stored to be UTF-8 encoded, sometimes invalid strings get into the database. You can specify how to handle reading these characters using the unicode_error connection option. This uses the same values as the unicode type (https://docs.python.org/2/library/functions.html#unicode)

Try changing the 'unicode_error' key in your connection params from strict to either replace or ignore:

cur = vertica_python.Connection({..., 'unicode_error': 'strict'}).cursor() 
cur.execute(r"SELECT E'\xC2'") cur.fetchone()
# caught 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data

cur = vertica_python.Connection({..., 'unicode_error': 'replace'}).cursor() 
cur.execute(r"SELECT E'\xC2'") cur.fetchone()
# �

cur = vertica_python.Connection({..., 'unicode_error': 'ignore'}).cursor() 
cur.execute(r"SELECT E'\xC2'") cur.fetchone()
#
respondcreate
  • 1,780
  • 1
  • 20
  • 23