0

I'm trying to pull data from SQL Server using pyodbc and load it into a dataframe, then export it to an HTML file, except I keep receiving the following Unicode error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 15500: ordinal not in range(128)

Here is my current setup (encoding instructions per docs):

cnxn =  pyodbc.connect('DSN=Planning;UID=USER;PWD=PASSWORD;')
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252', to=unicode)
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252', to=unicode)
cnxn.setdecoding(pyodbc.SQL_WMETADATA, encoding='cp1252', to=unicode)
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')
cursor = cnxn.cursor()

with open('Initial Dataset.sql') as f:
    initial_query = f.read()

cursor.execute(initial_query)
columns = [column[0] for column in cursor.description]
initial_data = cursor.fetchall()
i_df = pd.DataFrame.from_records(initial_data, columns=columns)
i_df.to_html('initial.html')

An odd but useful point to note is that when I try to export a CSV:

i_df.to_csv('initial.csv')

I get the same error, however when I add:

i_df.to_csv('initial.csv', encoding='utf-8')

It works. Can someone help me understand this encoding issue?

Side note: I've also tried using a sqlalchemy connection and pandas.read_sql() and the same error persists.

Jon Behnken
  • 560
  • 1
  • 3
  • 14
  • The error means you are trying to encode an (Unicode) character not representable in ASCII to ASCII. I'm just guessing, but your data frame returned by pandas is encoded in utf-8. I suspect the to=unicode is wrong, but just a shot in the dark. – stephanmg Nov 01 '19 at 14:38
  • I understand what the error means, I just don't understand why it's occurring. The dataframe is `utf-8` encoded. The [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_html.html?highlight=to_html#pandas.DataFrame.to_html) for `pandas.to_html` are rather scant. Why would it try to convert to ASCII when generating the HTML? – Jon Behnken Nov 01 '19 at 14:44
  • I'm not sure, but I would check the pandas.to_html source code to see what's happening there (Maybe encoding defaults to ASCII, I dont know). – stephanmg Nov 01 '19 at 14:50
  • 1
    You shouldn't need any `setencoding`/`setdecoding` calls at all when working with SQL Server, especially not encoding to UTF-8, which SQL Server ODBC does not use (it uses UTF-16, and that is the default encoding for pyodbc). – Gord Thompson Nov 01 '19 at 14:53
  • @GordThompson According to the [docs](https://github.com/mkleehammer/pyodbc/wiki/Unicode) for `pyodbc`, SQL drivers can behave differently and return different encodings. For example, according to my SQL Server, the encoding is `latin1` however, after reading [this answer](https://stackoverflow.com/questions/39968891/python-3-and-b-x92-decodelatin1) I determined the proper encoding is actually `windows-1252`. So I do believe the decoding/encoding calls are necessary. – Jon Behnken Nov 01 '19 at 15:14
  • 1
    From [here](https://github.com/mkleehammer/pyodbc/wiki/Unicode#microsoft-sql-server): "SQL Server's recent drivers match the specification, so no configuration is necessary. Using the pyodbc defaults is recommended." – Gord Thompson Nov 01 '19 at 15:28
  • I see now that you are right that the decoding/encoding calls are unnecessary, however only because I want to stay in Unicode. Note that I'm using Python 2.7, so if I wanted `str` results, I would have to make those calls, as per the docs we both linked to. – Jon Behnken Nov 01 '19 at 15:35

1 Answers1

1

The second answer on this question seems to be an acceptable workaround, except for Python 2.x users, you must use io, so:

import io

html = df.to_html()
with io.open("mypage.html", "w", encoding="utf-8") as file:
    file.write(html)

It was not included in the latest release, but it looks like the next version of pandas will have an encoding option for to_html(), see docs (line 2228).

Jon Behnken
  • 560
  • 1
  • 3
  • 14
  • Yes, that's correct. The encoding should be applied to the output file, not the communications between pyodbc and the SQL Server. – Gord Thompson Nov 01 '19 at 15:29
  • The problem ultimately lies in pandas, as `to_html()` seems to enforce ASCII encoding. It appears they will be fixing that issue in an upcoming release. – Jon Behnken Nov 01 '19 at 15:36
  • "`to_html()` seems to enforce ASCII encoding" - No, more likely that `to_html` uses the *default* encoding for the file when you only pass it a string (filepath) for `buf=`, and the default string encoding for Python_2 is ASCII. – Gord Thompson Nov 01 '19 at 15:47
  • @GordThompson okay, but with Python 2 and no way to to tell the function to use a different encoding, that is practically the same thing, no? – Jon Behnken Nov 01 '19 at 15:52
  • The way to "tell the function" is to pass it a `buf` argument that is a StringIO-like object instead of just a (string) path. – Gord Thompson Nov 01 '19 at 15:54