1

I am having raw input in text format having special characters in string.I want to change these special character from strings so that after running code there will not be any special character in it.

enter image description here

enter image description here

I tried to write below code.I am not sure whether it is right or wrong.

def avoid(x):
#print(x)
#value=[]
for ele in range(0, len(x)):
    
    p=invalidcharch(ele)
    #value.append(p)
      #value=''.join(p)
    print(p)    
return p
def invalidcharch(e):
items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

for i, j in items.items():
    e = e.replace(i, j)
return e

for col in df.columns:
 df[col]=df[col].apply(lambda x:avoid(x))

but in above code I am unable to store whole string in variable p.I need to store whole string value in p so that it will store replace cell value. Data containing mix datatype values like string integer.

col A
Junto à Estação de Carcavelos;
Bragança
Situado en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet.
Cartão MOBI.E R. Conselheiro Emídio Navarro (frente ao ISEL)

After chnage
Junto a Estacao de Carcavelos;
Braganca
Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
Cartao MOBI.E
R. Conselheiro Emidio Navarro (frente ao ISEL)

manuja
  • 31
  • 5

5 Answers5

2

Adding to Achille Huet's comment that links this question, you can use this on a pandas dataframe column like this:

import unidecode
df['col A'] = df['col A'].apply(lambda x: unidecode.unidecode(x))

OR

import unidecode
for col in df.columns:
    df[col]=df[col].apply(lambda x: unidecode.unidecode(x))

However, since you have already created the special characters dictionary, you can use it:

Just create a dictionary special_chars and replace the values on the entire dataframe by passing regex=True. This should also be faster. I don't know if there is a faster solution using unicode. It also depends on what you are doing with it. If sending to a .csv file for example, I believe there is a parameter in to_csv() as well, but I am not sure if that is relevant:

special_chars = {"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"",
"ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N",
"Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O",
"ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}

df.replace(special_chars, regex=True)
David Erickson
  • 16,433
  • 2
  • 19
  • 35
1

We can use Series.str.translate which is equivalent to str.maketrans + str.translate in python.

converter = str.maketrans(items) # `items` is special chars dict.
df['colA'].str.translate(converter)

0                                              Junto a Estacao de Carcavelos;
1                                                                    Braganca
2    Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3                Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Name: col A, dtype: object
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
1

Using standard unicodedata module:

import unicodedata

df["col A"] = df["col A"].apply(
    lambda x: unicodedata.normalize("NFD", x)
    .encode("ascii", "ignore")
    .decode("utf-8")
)
print(df)

Prints:

                                                                      col A
0                                            Junto a Estacao de Carcavelos;
1                                                                  Braganca
2  Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3              Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Not fully understood what you are trying to achieve, but you can try something like

items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

df = pd.DataFrame([
    'abcä',
    'Ãbcd12345'
], columns=['colA'])

df['colA'] = df['colA'].str.replace(r'[^\x00-\x7F]', lambda x: items.get(x.group(0)) or '_', regex=True)

df
    colA
0   abca
1   Abcd12345

For r'[^\x00-\x7F] check Regular expression that finds and replaces non-ascii characters with Python

Epsi95
  • 8,832
  • 1
  • 16
  • 34
0

You can do that simply with the following part of code.

for i in df.columns:

    df[i] = df[i].replace(items, regex=True)
choka
  • 30
  • 5