1

There is a DataFrame in pandas, see image below

input

Basically it is a table scraped from Wikipedia's article: https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle

For further processing, I am trying to clean up the data. So, these statements work well

df['Name'] = df['Name'].str.replace('\d+', '')
df['Name'] = df['Name'].str.strip()
df['Name'] = df['Name'].str.replace(',', '')
df['Name'] = df['Name'].str.replace('­-', '')

But how can I bring all these four statements into one? Probably using regular expressions.

I tried with df['Name'] = df['Name'].str.replace(r'[\d\-,]+', '') but it did not work. Maybe because of the word wrap character that was used.

My desired output is " Ber,li-n2 "-> "Berlin".

The unknown circumstances are going around 'Mönchen­gladbach1, 5'.

Taras
  • 266
  • 6
  • 23

2 Answers2

1

You are removing the data, so you may join the patterns you remove into a single pattern like the one you have. r'[\d,-]+' is a bit better stylistically.

You may remove any dash punctuation + soft hyphen (\u00AD) using [\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D], so you may want to add these codes to the regex.

Remember to assign the cleaned data back to the column and add .str.stip().

You may use

df['Name'] = df['Name'].str.replace(r'[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+', '').str.strip()

If you do not want to add str.strip(), add ^\s+ and \s+$ alternatives to the regex:

df['Name'] = df['Name'].str.replace(r'^\s+|[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+|\s+$', '')

Details

  • ^\s+ - 1+ whitespaces at the start of the string
  • | - or
  • [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\d,-]+ - 1 or more Unicode dashes, digits, commas or - chars
  • | - or
  • \s+$ - 1+ whitespaces at the end of the string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I do not know why but it still does not work with bloody `'Mönchen­gladbach1, 5'`, see [here](https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle). – Taras May 21 '20 at 10:02
  • @Taras There is an invisible char, here, `\u00AD`, soft hyphen, I added it to the answer. – Wiktor Stribiżew May 21 '20 at 10:04
  • Dziękuję bardzo! Works like a charm! Can you be so kind and tell me a little bit more about this "invisible" char? or refer me to a source. – Taras May 21 '20 at 10:21
  • @Taras I copied your string and pasted into https://r12a.github.io/app-conversion/ convert field, then I instantly saw the char and its code. [\u00AD](https://r12a.github.io/uniview/?charlist=%C2%AD) belongs to `\p{Cf}` category, no idea why. – Wiktor Stribiżew May 21 '20 at 10:25
1

You can go with

df['Name'] = df['Name'].str.replace('(\d+|,|­<|>|-)', '') 

Put the items you want to sort out into a group, and seperate different options using the pipe |

yoko
  • 516
  • 1
  • 3
  • 18
  • I do not know why but it still does not work with bloody `'Mönchen­gladbach1, 5'`, see [here](https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9Fst%C3%A4dte_in_Deutschland#Tabelle). – Taras May 21 '20 at 10:02
  • In case you want a simple conversion from `Mönchen­gladbach1, 5` to `Mönchen­gladbach`, this statement works. But its hard to say how it behaves for you, without knowing the exact output of your `df`. Taking a look at the link, i can see that there is a linbreak between `Mönchen` and `gladbach`. Either add it as `\n` to the group of items to be replaces, or put your df into the question, so we can have a look at it. – yoko May 21 '20 at 10:25