replacing special characters from string

Question

I am having raw input in text format having special characters in string.I want to change these special character from strings so that after running code there will not be any special character in it.

I tried to write below code.I am not sure whether it is right or wrong.

def avoid(x):
#print(x)
#value=[]
for ele in range(0, len(x)):
    
    p=invalidcharch(ele)
    #value.append(p)
      #value=''.join(p)
    print(p)    
return p
def invalidcharch(e):
items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

for i, j in items.items():
    e = e.replace(i, j)
return e

for col in df.columns:
 df[col]=df[col].apply(lambda x:avoid(x))

but in above code I am unable to store whole string in variable p.I need to store whole string value in p so that it will store replace cell value. Data containing mix datatype values like string integer.

col A
Junto à Estação de Carcavelos;
Bragança
Situado en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet.
Cartão MOBI.E R. Conselheiro Emídio Navarro (frente ao ISEL)

After chnage
Junto a Estacao de Carcavelos;
Braganca
Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
Cartao MOBI.E
R. Conselheiro Emidio Navarro (frente ao ISEL)

Can you please edit your question and put there the input and output dataframe in text form (so we can copy-paste it)? — Andrej Kesely, Jun 29 '21 at 15:10
check out this answer here : https://stackoverflow.com/q/517923/13357061 — Avandale, Jun 29 '21 at 15:13

David Erickson · Answer 1 · 2021-06-29T15:25:33.717

Adding to Achille Huet's comment that links this question, you can use this on a pandas dataframe column like this:

import unidecode
df['col A'] = df['col A'].apply(lambda x: unidecode.unidecode(x))

OR

import unidecode
for col in df.columns:
    df[col]=df[col].apply(lambda x: unidecode.unidecode(x))

However, since you have already created the special characters dictionary, you can use it:

Just create a dictionary special_chars and replace the values on the entire dataframe by passing regex=True. This should also be faster. I don't know if there is a faster solution using unicode. It also depends on what you are doing with it. If sending to a .csv file for example, I believe there is a parameter in to_csv() as well, but I am not sure if that is relevant:

special_chars = {"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"",
"ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N",
"Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O",
"ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}

df.replace(special_chars, regex=True)

This is really awesome, did not know that replace takes a dict. — Epsi95, Jun 29 '21 at 15:25

score 1 · Answer 2 · answered Jun 29 '21 at 16:40

We can use Series.str.translate which is equivalent to str.maketrans + str.translate in python.

converter = str.maketrans(items) # `items` is special chars dict.
df['colA'].str.translate(converter)

0                                              Junto a Estacao de Carcavelos;
1                                                                    Braganca
2    Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3                Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Name: col A, dtype: object

score 1 · Answer 3 · answered Jun 29 '21 at 17:09

Using standard unicodedata module:

import unicodedata

df["col A"] = df["col A"].apply(
    lambda x: unicodedata.normalize("NFD", x)
    .encode("ascii", "ignore")
    .decode("utf-8")
)
print(df)

Prints:

                                                                      col A
0                                            Junto a Estacao de Carcavelos;
1                                                                  Braganca
2  Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3              Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)

score 0 · Answer 4 · answered Jun 29 '21 at 15:22

Not fully understood what you are trying to achieve, but you can try something like

items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

df = pd.DataFrame([
    'abcä',
    'Ãbcd12345'
], columns=['colA'])

df['colA'] = df['colA'].str.replace(r'[^\x00-\x7F]', lambda x: items.get(x.group(0)) or '_', regex=True)

df

    colA
0   abca
1   Abcd12345

For r'[^\x00-\x7F] check Regular expression that finds and replaces non-ascii characters with Python

choka · Answer 5 · 2021-06-29T15:43:12.430

0

You can do that simply with the following part of code.

for i in df.columns:

    df[i] = df[i].replace(items, regex=True)

edited Jun 29 '21 at 15:43

answered Jun 29 '21 at 15:37

choka

30
5

replacing special characters from string

5 Answers5