1

I'm trying to get a .csv file, filter some columns and create a new MySQL table with the new filtered df.

The issue is that I have one particular value within that .csv with a different char set, as you can see below ('\ufeffY'):

df['Supported'].unique() 
array(['Y', '\ufeffY'], dtype=object)

After that, I created a new df filtered with the columns renamed. I'm trying normalize those values. First, I'm removing unwanted possible spaces and then converting it to lower case (this column's value from the .CSV file should have in theory only "y" or "n")

df2['supported'] = df2['supported'].str.strip()
df2['supported'] = df2['supported'].str.lower()

Then, I'm trying to set to "?" anything that is different from "y" or "n", which I believe would fix that '\ufeffY' value from the original df.

df2.loc[~df2['supported'].isin(['y', 'n']), 'supported'] = '?'

If I check df2['supported'].unique() after the operation above, I got this:

array(['y', '?'], dtype=object)

I assumed that the '?' would be, maybe among other values, the '\ufeffY' that I'm trying to get rid of, and then I'm trying to use that df2 with sqlalchemy to create a new table by this:

df2.to_sql('ServersList', engine, index=False, if_exists='replace',
                  dtype={'supported': sqlalchemy.types.Text,
                         'server_name':  sqlalchemy.types.Text,
                         'owner': sqlalchemy.types.Text,
                         'critical': sqlalchemy.types.Text,
                         'business_function': sqlalchemy.types.Text,
                         'IP': sqlalchemy.types.Text,
                         'activated': sqlalchemy.types.DateTime(),
                         'service_type': sqlalchemy.types.Text,
                         'platform': sqlalchemy.types.Text,
                         'OS': sqlalchemy.types.Text,
                         'classification': sqlalchemy.types.Text,
                         'support_team': sqlalchemy.types.Text,
                         'virtual_server': sqlalchemy.types.Text,
                         'host_name': sqlalchemy.types.Text,
                         'location': sqlalchemy.types.Text,
                         'log': sqlalchemy.types.DateTime()}
        )

And that is when I got the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 10: character maps to <undefined>

The message is pretty clear, but I thought that df2 would not have that value anymore and therefore it should work.

Obviously I'm missing something here.

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • 1
    Looks like an encoding issue between your database and your machine's default encoding. Perhaps the following resource is helpful? https://stackoverflow.com/q/17912307/10521959 – Yaakov Bressler Dec 17 '20 at 16:40
  • 1
    Thanks Yaakov, indeed it was. I had to use encoding='utf-8' in the .read_csv() call, and I also changed the DB to utf8. I thought that by creating a new df and replacing that value by "?" would avoid this, but for sure I was doing something wrong there. I appreciate your response. – Renato Ramos Dec 17 '20 at 17:17

1 Answers1

0

Well, at the end what fixed that was to add the enconding='utf-8' during the call of the .read_csv() function, and just in case I also changed the DB to the utf8 format, by default it was created as utf8mb4