0

I'm trying to clean an SQL retail database however i'm confused at the structure of the first name columns so ideally i would like a clean set of names

What i had attempted was

 #change the datatype of first_name to str
        user_dataframe['first_name'] = user_dataframe['first_name'].astype('string')
        user_dataframe['last_name'] = user_dataframe['last_name'].astype('string')

Which just changed the data type from object to string but now i am not sure how to search for the strings that i do not want

the string which are dirty come in this format

Hans Jürgen
Bärbel
Süleyman
Sören
Klaus-Jürgen
2GU3G97VI1
I7IJDAPMIM
Gülten
DD0K0FUDRY

What i am thinking if using a regex expression to drop any rows the have the following pattern character followed by number but i'm not sure what some of the symbols mean on dirty data

Darman
  • 175
  • 10

2 Answers2

1

The problem is the encoding in Python/Pandas. Try to change the encoding, when reading the data. See https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html and potentially https://docs.python.org/3/library/codecs.html#standard-encodings.

Also see the following answer: Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

guscht
  • 843
  • 4
  • 20
  • I know that they use utf-8 for the default encoding when reading into pandas dataframe, does that mean the weird symbols are from reading as a utf-8 when it should've been read from sql using a different encoding – Darman Jan 10 '23 at 19:17
0

As someone has mentioned, it is the encoding type which caused the issue, using utf-8-sig encoding fixed the issue of the weird characters

Darman
  • 175
  • 10