2

I am a beginner in regex and wanted to ask how you can solve this problem with regex. At the moment I am trying to preprocess german text. German has a few specific characters in it's alphabet (ä, ö, ü). However those letters can also be written in a different way (ae, oe, ue). So I simply used the replace method, which worked fine.

import pandas as pd
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df["text"] = df["text"].str.replace("ae", "ä")
df["text"] = df["text"].str.replace("Ae", "Ä")
df["text"] = df["text"].str.replace("oe", "ö")
df["text"] = df["text"].str.replace("Oe", "Ö")
df["text"] = df["text"].str.replace("ue", "ü")
df["text"] = df["text"].str.replace("Ue", "Ü")

But there are also specific patterns where the replacement shouldn't take place. Like in the word "euer". With some help of this post, I tried to make a working regex expression: Regex Pattern to Match, Excluding when... / Except between

df["text"] = df["text"].str.replace("[AaÄäEe]ue|(ue)", "ü")

So if there are any of the characters in the brackets [AaÄäEe] and afterwards the "ue" follows, then I would like to exlude those cases. Otherwise "ue" will be replaced by "ü". But this doesn't work, so how do you do it? Thanks in advance.

Sento
  • 75
  • 6

2 Answers2

1

Should do the trick:

df["text"] = df["text"].str.replace("[^AaÄäEe](ue)", "ü")

The '^' means not in regex

Ryanless
  • 134
  • 1
  • 13
  • Thank you for the answer. If I use the word "Stueck" it would replace "tue" with "ü". But I only want to replace the "ue" part. When there are specific characters before the "ue" (those in the brackets), then I want to exlude those matches. So combinations of "Aue", "aue", "Äue", "äue", "Eue", "eue" should be exluded, otherwise replace "ue" with "ü". – Sento Aug 23 '18 at 09:46
  • 1
    Use a negative lookbehind: [`(?<![AaÄäEe])ue`](https://regex101.com/r/liazk6/1) instead. – 41686d6564 stands w. Palestine Aug 23 '18 at 09:51
  • @AhmedAbdelhameed Thank you very much for the answer. I like both solutions (yours and from @WiktorStribiżew). Gives me more inside into regex. – Sento Aug 23 '18 at 10:15
1

You may use

import re
import pandas as pd
dct = {'ae' : 'ä', 'Ae' : 'Ä', 'oe' : 'ö', 'Oe' : 'Ö', 'ue' : 'ü', 'Ue' : 'Ü'}
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df['text'].str.replace(r'[AaÄäEe]ue|([aouAOU]e)', lambda x: dct[x.group(1)] if x.group(1) else x.group())
# => 0    Übergang
#    1        euer
#    Name: text, dtype: object

The [AaÄäEe]ue|([aouAOU]e) pattern matches:

  • [AaÄäEe]ue - A, a, Ä, ä, E or e followed with ue substring
  • | - or
  • ([aouAOU]e) - Group 1: a, o, u, A, O or U and then e

The lambda x: dct[x.group(1)] if x.group(1) else x.group() lambda expression does the following: once Group 1 matches, dct[x.group(1)] will return the replacement string. Else, the match found is pasted back.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thank you very much for the answer. This seems to solve my problem and shortens the code I have written. – Sento Aug 23 '18 at 10:12