Regex: Match a specific pattern, exclude if match is in a specific context

Question

I am a beginner in regex and wanted to ask how you can solve this problem with regex. At the moment I am trying to preprocess german text. German has a few specific characters in it's alphabet (ä, ö, ü). However those letters can also be written in a different way (ae, oe, ue). So I simply used the replace method, which worked fine.

import pandas as pd
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df["text"] = df["text"].str.replace("ae", "ä")
df["text"] = df["text"].str.replace("Ae", "Ä")
df["text"] = df["text"].str.replace("oe", "ö")
df["text"] = df["text"].str.replace("Oe", "Ö")
df["text"] = df["text"].str.replace("ue", "ü")
df["text"] = df["text"].str.replace("Ue", "Ü")

But there are also specific patterns where the replacement shouldn't take place. Like in the word "euer". With some help of this post, I tried to make a working regex expression: Regex Pattern to Match, Excluding when... / Except between

df["text"] = df["text"].str.replace("[AaÄäEe]ue|(ue)", "ü")

So if there are any of the characters in the brackets [AaÄäEe] and afterwards the "ue" follows, then I would like to exlude those cases. Otherwise "ue" will be replaced by "ü". But this doesn't work, so how do you do it? Thanks in advance.

score 1 · Answer 1 · answered Aug 23 '18 at 09:27

1

Should do the trick:

df["text"] = df["text"].str.replace("[^AaÄäEe](ue)", "ü")

The '^' means not in regex

answered Aug 23 '18 at 09:27

Ryanless

134
1
13

Thank you for the answer. If I use the word "Stueck" it would replace "tue" with "ü". But I only want to replace the "ue" part. When there are specific characters before the "ue" (those in the brackets), then I want to exlude those matches. So combinations of "Aue", "aue", "Äue", "äue", "Eue", "eue" should be exluded, otherwise replace "ue" with "ü". – Sento Aug 23 '18 at 09:46
1

Use a negative lookbehind: [`(?<![AaÄäEe])ue`](https://regex101.com/r/liazk6/1) instead. – 41686d6564 stands w. Palestine Aug 23 '18 at 09:51
@AhmedAbdelhameed Thank you very much for the answer. I like both solutions (yours and from @WiktorStribiżew). Gives me more inside into regex. – Sento Aug 23 '18 at 10:15

score 1 · Accepted Answer · answered Aug 23 '18 at 09:50

You may use

import re
import pandas as pd
dct = {'ae' : 'ä', 'Ae' : 'Ä', 'oe' : 'ö', 'Oe' : 'Ö', 'ue' : 'ü', 'Ue' : 'Ü'}
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df['text'].str.replace(r'[AaÄäEe]ue|([aouAOU]e)', lambda x: dct[x.group(1)] if x.group(1) else x.group())
# => 0    Übergang
#    1        euer
#    Name: text, dtype: object

The [AaÄäEe]ue|([aouAOU]e) pattern matches:

[AaÄäEe]ue - A, a, Ä, ä, E or e followed with ue substring
| - or
([aouAOU]e) - Group 1: a, o, u, A, O or U and then e

The lambda x: dct[x.group(1)] if x.group(1) else x.group() lambda expression does the following: once Group 1 matches, dct[x.group(1)] will return the replacement string. Else, the match found is pasted back.

Thank you very much for the answer. This seems to solve my problem and shortens the code I have written. — Sento, Aug 23 '18 at 10:12

Regex: Match a specific pattern, exclude if match is in a specific context

2 Answers2

Linked