I have already asked a regex question regarding replacing specific patterns (Regex: Match a specific pattern, exclude if match is in a specific context). This is all done for preprocessing text data for training.
Now I would like to use regex to replace anything except unicode letters in a pandas data frame. I have used
to get the regex expression which seems to solve my problem \p{^L}+
. I realised later that I have found the expression which would work in Perl (Perl Compatible Regular Expressions [PCRE]) not necessarily in Python. I have found the regex package which supports this expression too. However, pandas doesn't seem to support regex yet or I have used it the wrong way:
import regex
import pandas as pd
df = pd.DataFrame({"text": ["Room: 25m²", "I have eaten ¼ of the cake."]})
df["text"] = df["text"].str.replace(regex.compile("\p{^L}+"), " ")
# Returns a TypeError: object of type '_regex.Pattern' has no len()
Therefore I have tried to find ways to use the re package. I have found an answer here. So I have used it this way:
import re
import pandas as pd
df = pd.DataFrame({"text": ["Room: 25m²", "I have eaten ¼ of the cake."]})
df["text"] = df["text"].str.replace("[\W\d_]", " ")
It does replace a lot of special characters. It doesn't replace the expression with m to the power of 2 or the expression where we have a fraction. Both characters which I wouldn't see as letters but rather numerics or special characters in unicode. So how can I deal with those special characters? Is it possible with the re package? I wouldn't like to use specific unicodes to match those cases. If possible a general solution would be appreciated.