I'm aware there are a zillions posts about encoding / decoding problems on the forum but after going through half of them I wasn't able to find one that did the trick for me. So be nice if it is somewhere in the other half...
My issue :
I have a dbase (MS SQL) containing multilingual data (Latin1_General_CI_AS COLLATE), and I am using pymssql and pandas to convert it to a dataframe for use outside of python. All works fine except for the non latin characters and I'm completely stuck at this moment.
This is my (simplified) python 3 code:
import pandas as pd
import pymssql
def rm_main():
conn = pymssql.connect(server='***',port=4133, user='***', charset='UTF-8', password='***', database='**')
q="""
SELECT goodmorning FROM myTable
"""
df = pd.read_sql(q,conn)
df['encoded_goodmorning'] = df.goodmorning.str.encode('utf-8')
return df
what is in my database is a field called goodmorning, and it contains the following string : Dzień dobry
When calling the data as above, using just pymssql the data is retrieved correctly.
When I want to use the read_sql method form pandas I get the dreadfull question mark as follows : Dzie? dobry
Using the encoding options I get a bit further in the right direction as I get the following : b'Dziexc5x84 dobry', where c5 84 is the utf hex code for my small latin n with acute. So my content is complete but it is not very reader friendly.
Now where I fail miserably is to get this into the 'friendly format' again (so that it just says 'Dzień dobry' again).
What do I overlook here? Are there better approaches to do this? it seems like something very obvious but whatever I tried (encoding / decoding) either doesn't make a difference or it simply brakes the code.