1

I am working with an existing SQLite database and experiencing errors due to the data being encoded in CP-1252, when Python is expecting it to be UTF-8.

>>> import sqlite3
>>> conn = sqlite3.connect('dnd.sqlite')
>>> curs = conn.cursor()
>>> result = curs.execute("SELECT * FROM dnd_characterclass WHERE id=802")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
OperationalError: Could not decode to UTF-8 column 'short_description_html'
with text ' <p>Over a dozen deities have worshipers who are paladins, 
promoting law and good across Faer�n, but it is the Weave itself that 

The offending character is \0xfb which decodes to û. Other offending texts include “?nd and slay illithids.” which uses "smart quotes" \0x93 and \0x94.

SQLite, python, unicode, and non-utf data details how this problem can be solved when using sqlite3 on its own.

However, I am using SQLAlchemy. How can I deal with CP-1252 encoded data in an SQLite database, when I am using SQLAlchemy?


Edit:

This would also apply for any other funny encodings in an SQLite TEXT field, like latin-1, cp437, and so on.

Community
  • 1
  • 1
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49

2 Answers2

3

SQLAlchemy and SQLite are behaving normally. The solution is to fix the non-UTF-8 data in the database.

I wrote the below, drawing inspiration from https://stackoverflow.com/a/2395414/1191425 . It:

  • loads up the target SQLite database
  • lists all columns in all tables
  • if the column is a text, char, or clob type - including variants like varchar and longtext - it re-encodes the data from the INPUT_ENCODING to UTF-8.

INPUT_ENCODING = 'cp1252' # The encoding you want to convert from
import sqlite3
db = sqlite3.connect('dnd_fixed.sqlite')
db.create_function('FIXENCODING', 1, lambda s: str(s).decode(INPUT_ENCODING))
cur = db.cursor()
tables = cur.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
tables = [t[0] for t in tables]
for table in tables:
    columns = cur.execute('PRAGMA table_info(%s)' % table ).fetchall() # Note: pragma arguments can't be parameterized.
    for column_id, column_name, column_type, nullable, default_value, primary_key in columns:
        if ('char' in column_type) or ('text' in column_type) or ('clob' in column_type):
            # Table names and column names can't be parameterized either.
            db.execute('UPDATE "{0}" SET "{1}" = FIXENCODING(CAST("{1}" AS BLOB))'.format(table, column_name))

After this script runs, all *text*, *char*, and *clob* fields are in UTF-8 and no more Unicode decoding errors will occur. I can now Faerûn to my heart's content.

Community
  • 1
  • 1
Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
0

If you have a connection URI then you can add the following options to your DB connection URI:

DB_CONNECTION = mysql+pymysql://{username}:{password}@{host}/{db_name}?{options}
DB_OPTIONS = {
    "charset": "cp-1252",
    "use_unicode": 1,
}
connection_uri = DB_CONNECTION.format(
    username=???,
    ...,
    options=urllib.urlencode(DB_OPTIONS)        
)

Assuming your SQLLite driver can handle those options (pymysql can, but I don't know 100% about sqllite), then your queries will return unicode strings.

Tom Dalton
  • 6,122
  • 24
  • 35
  • Having a play at a console, and looking at the SQLAlchemy docs for the `sqlite` dialect, I don't think that the `sqlite` dialect supports the `charset` option. – Li-aung Yip Mar 13 '15 at 16:23
  • You might be better off asking this question on the SQLAlchemy mailing list - the author is very active and helpful on it. – Tom Dalton Mar 13 '15 at 16:31
  • I ended up solving it by dropping down to the `sqlite3` module, and re-encoding the contents of all `TEXT` columns from `cp1252` to UTF-8. – Li-aung Yip Mar 14 '15 at 11:44