3

I have already asked a regex question regarding replacing specific patterns (Regex: Match a specific pattern, exclude if match is in a specific context). This is all done for preprocessing text data for training.

Now I would like to use regex to replace anything except unicode letters in a pandas data frame. I have used

to get the regex expression which seems to solve my problem \p{^L}+. I realised later that I have found the expression which would work in Perl (Perl Compatible Regular Expressions [PCRE]) not necessarily in Python. I have found the regex package which supports this expression too. However, pandas doesn't seem to support regex yet or I have used it the wrong way:

import regex
import pandas as pd
df = pd.DataFrame({"text": ["Room: 25m²", "I have eaten ¼ of the cake."]})
df["text"] = df["text"].str.replace(regex.compile("\p{^L}+"), " ")

# Returns a TypeError: object of type '_regex.Pattern' has no len()

Therefore I have tried to find ways to use the re package. I have found an answer here. So I have used it this way:

import re
import pandas as pd
df = pd.DataFrame({"text": ["Room: 25m²", "I have eaten ¼ of the cake."]})
df["text"] = df["text"].str.replace("[\W\d_]", " ")

It does replace a lot of special characters. It doesn't replace the expression with m to the power of 2 or the expression where we have a fraction. Both characters which I wouldn't see as letters but rather numerics or special characters in unicode. So how can I deal with those special characters? Is it possible with the re package? I wouldn't like to use specific unicodes to match those cases. If possible a general solution would be appreciated.

Sento
  • 75
  • 6
  • `.str.replace(regex.compile("\p{L}+")` would remove all the Unicode letters. I think you wanted to use `\P{L}` in a PCRE regex. Please clarify what you want to obtain from `"Room: 25m²", "I have eaten ¼ of the cake."` Maybe `Room m I have eaten of the cake`? The question is, what kind of chars do you want to remove that are also matched with `\w` but do not belong to `\d`? The `²` and `¼` belong to `\p{No}` that is not matched with `\d`. – Wiktor Stribiżew Sep 11 '18 at 07:41
  • You are right. I need to negate the expression. `\p{^L}+` is what I want. Sorry for the confusion. I have edited the question. I want to obtain all letters. `"Room: 25m²", "I have eaten ¼ of the cake."` would be `"Room m", "I have eaten of the cake"`. – Sento Sep 11 '18 at 07:41
  • Why are you importing `re` if you're not using it? – melpomene Sep 11 '18 at 07:45
  • 1
    Try https://regex101.com/r/UVCVAV/1. It is a bit long, but it handles all `\p{No}` chars. If you care only about BMP plane, remove all ranges with `\UXXXXXXXX` chars. – Wiktor Stribiżew Sep 11 '18 at 07:45
  • This seems to work. Thank you very much Wiktor! – Sento Sep 11 '18 at 08:02
  • I posted [the answer](https://stackoverflow.com/a/52271447/3832970), please consider accepting/upvoting. – Wiktor Stribiżew Sep 11 '18 at 08:13
  • Is there also an option to make it work with the regex package in pandas? – Sento Sep 11 '18 at 08:33

2 Answers2

1

The [\W\d_] is a regex that matches any non-word char (any char not matched with \w), it matches digits with \d and a _. Note that \d in a Unicode aware Python 3 regex only matches \p{Nd} (Number, decimal):

Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]).

The chars this pattern does not remove in your string belong to the \p{No} Unicode category (numbers, other).

So, if you plan to also remove all those chars from \p{No}, you need to add them to the pattern:

r'[\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A47\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00016B5B-\U00016B61\U0001D360-\U0001D371\U0001E8C7-\U0001E8CF\U0001F100-\U0001F10C\W\d_]+'

See the regex demo.

You may see the chars listed on this page page.

Also, be aware of a Number, letter category, see the \p{Nl} char list here.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

This should work for you:

import regex
import pandas as pd
df = pd.DataFrame({"text": ["Room: 25m²", "I have eaten ¼ of the cake."]})


regex_pat = re.compile(r"[^a-zA-Z\s]")
df["text"] = df["text"].str.replace(regex_pat, "")

Output:

0                       Room m
1    I have eaten  of the cake
Name: text, dtype: object
imjoseangel
  • 3,543
  • 3
  • 22
  • 30