Combining several replace-statements into one in pandas

Question

There is a DataFrame in pandas, see image below

Basically it is a table scraped from Wikipedia's article: https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle

For further processing, I am trying to clean up the data. So, these statements work well

df['Name'] = df['Name'].str.replace('\d+', '')
df['Name'] = df['Name'].str.strip()
df['Name'] = df['Name'].str.replace(',', '')
df['Name'] = df['Name'].str.replace('-', '')

But how can I bring all these four statements into one? Probably using regular expressions.

I tried with df['Name'] = df['Name'].str.replace(r'[\d\-,]+', '') but it did not work. Maybe because of the word wrap character that was used.

My desired output is " Ber,li-n2 "-> "Berlin".

The unknown circumstances are going around 'Mönchengladbach1, 5'.

`df['Name'] = df['Name'].str.replace(r'[\d\-,]+', '')` should work. Or `df['Name'] = df['Name'].str.replace(r'[\d,-]+', '').str.strip()` — Wiktor Stribiżew, May 21 '20 at 09:42

Wiktor Stribiżew · Accepted Answer · 2020-05-21T10:05:11.583

1

You are removing the data, so you may join the patterns you remove into a single pattern like the one you have. r'[\d,-]+' is a bit better stylistically.

You may remove any dash punctuation + soft hyphen (\u00AD) using [\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D], so you may want to add these codes to the regex.

Remember to assign the cleaned data back to the column and add .str.stip().

You may use

df['Name'] = df['Name'].str.replace(r'[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+', '').str.strip()

If you do not want to add str.strip(), add ^\s+ and \s+$ alternatives to the regex:

df['Name'] = df['Name'].str.replace(r'^\s+|[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+|\s+$', '')

Details

^\s+ - 1+ whitespaces at the start of the string
| - or
[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+ - 1 or more Unicode dashes, digits, commas or - chars
| - or
\s+$ - 1+ whitespaces at the end of the string.

edited May 21 '20 at 10:05

answered May 21 '20 at 09:45

Wiktor Stribiżew

607,720
39
448
563

I do not know why but it still does not work with bloody `'Mönchengladbach1, 5'`, see [here](https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle). – Taras May 21 '20 at 10:02
@Taras There is an invisible char, here, `\u00AD`, soft hyphen, I added it to the answer. – Wiktor Stribiżew May 21 '20 at 10:04
Dziękuję bardzo! Works like a charm! Can you be so kind and tell me a little bit more about this "invisible" char? or refer me to a source. – Taras May 21 '20 at 10:21
@Taras I copied your string and pasted into https://r12a.github.io/app-conversion/ convert field, then I instantly saw the char and its code. [\u00AD](https://r12a.github.io/uniview/?charlist=%C2%AD) belongs to `\p{Cf}` category, no idea why. – Wiktor Stribiżew May 21 '20 at 10:25

score 1 · Answer 2 · answered May 21 '20 at 09:48

1

You can go with

df['Name'] = df['Name'].str.replace('(\d+|,|<|>|-)', '')

Put the items you want to sort out into a group, and seperate different options using the pipe |

answered May 21 '20 at 09:48

yoko

516
1
3
18

I do not know why but it still does not work with bloody `'Mönchengladbach1, 5'`, see [here](https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle). – Taras May 21 '20 at 10:02
In case you want a simple conversion from `Mönchengladbach1, 5` to `Mönchengladbach`, this statement works. But its hard to say how it behaves for you, without knowing the exact output of your `df`. Taking a look at the link, i can see that there is a linbreak between `Mönchen` and `gladbach`. Either add it as `\n` to the group of items to be replaces, or put your df into the question, so we can have a look at it. – yoko May 21 '20 at 10:25

Combining several replace-statements into one in pandas­

2 Answers2

Combining several replace-statements into one in pandas