1

I'm aware there are a zillions posts about encoding / decoding problems on the forum but after going through half of them I wasn't able to find one that did the trick for me. So be nice if it is somewhere in the other half...

My issue :

I have a dbase (MS SQL) containing multilingual data (Latin1_General_CI_AS COLLATE), and I am using pymssql and pandas to convert it to a dataframe for use outside of python. All works fine except for the non latin characters and I'm completely stuck at this moment.

This is my (simplified) python 3 code:

import pandas as pd
import pymssql

def rm_main():

    conn = pymssql.connect(server='***',port=4133, user='***',  charset='UTF-8', password='***', database='**')
    q="""
    SELECT goodmorning FROM myTable
    """
    df = pd.read_sql(q,conn)
    df['encoded_goodmorning'] = df.goodmorning.str.encode('utf-8')

    return df

what is in my database is a field called goodmorning, and it contains the following string : Dzień dobry

When calling the data as above, using just pymssql the data is retrieved correctly.

When I want to use the read_sql method form pandas I get the dreadfull question mark as follows : Dzie? dobry

Using the encoding options I get a bit further in the right direction as I get the following : b'Dziexc5x84 dobry', where c5 84 is the utf hex code for my small latin n with acute. So my content is complete but it is not very reader friendly.

Now where I fail miserably is to get this into the 'friendly format' again (so that it just says 'Dzień dobry' again).

What do I overlook here? Are there better approaches to do this? it seems like something very obvious but whatever I tried (encoding / decoding) either doesn't make a difference or it simply brakes the code.

Wokoman
  • 1,089
  • 2
  • 13
  • 30

0 Answers0