1

I am trying to remove all punctuation and special characters from a string, including numbers, but I get an error: error: bad escape \p at position 2

Does this mean that python's regex does not recognize \p{S} and \p{P}

The code is:

name = "URL-dsds diasa:dksdjsk dskdjs_dskjdks 23232 dsds32 dskdjskds&dsjdsjdhs fddjfd%djshdhjs kdjs¤dskjds öfdfdjfkdj"
re.findall(r'[^\p{P}\p{S}\s\d]+', name.lower())

I expect as output the same as highlighted by regex101: https://regex101.com/r/HJZAUU/1

Any help?

c1377554
  • 173
  • 1
  • 10
  • 1
    Use PyPi `regex` module to be able to use Unicode category classes. Or, since you need to only match letters, just use `r'[^\W\d_]+'`, see the [**regex demo**](https://regex101.com/r/D2ELAm/1) – Wiktor Stribiżew Nov 08 '19 at 08:47
  • This the the same question as the following: https://stackoverflow.com/questions/54330673/how-to-fix-error-bad-escape-u-at-position-0 – Zen Zac Nov 08 '19 at 08:50
  • Not the same @ZenZac, look at what I asked and proposed solution by Wiktor, totally different than the link you shared. – c1377554 Nov 08 '19 at 08:51
  • 1
    Closed with the correct [Python regex matching Unicode properties](https://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties) thread. – Wiktor Stribiżew Nov 08 '19 at 08:57

2 Answers2

1

I followed @WiktorStribiżew comment, to use PyPi regex as it supports Unicode category classes. So I simply did:

pip install regex
import regex as re
name = "URL-dsds diasa:dksdjsk dskdjs_dskjdks 23232 dsds32 dskdjskds&dsjdsjdhs fddjfd%djshdhjs kdjs¤dskjds öfdfdjfkdj"
re.findall(r'[^\p{P}\p{S}\s\d]+', name.lower())

I get output:

['url', 'dsds', 'diasa', 'dksdjsk', 'dskdjs', 'dskjdks', 'dsds', 'dskdjskds', 'dsjdsjdhs', 'fddjfd', 'djshdhjs', 'kdjs', 'dskjds', 'öfdfdjfkdj']

c1377554
  • 173
  • 1
  • 10
0

Yes, unfortunately so.

Check out regex101.com Change the flavor to Python and paste your regex in the field at the top:

Gives you this info on the right:

[^\p{P}\p{S}\s\d]+

gm <Python>
Match a single character not present in the list below [^\p{P}\p{S}\s\d]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\p matches the character p literally (case sensitive) <<<<<<<<<<<<<<<<<<<<<<<<<<<<
{P} matches a single character in the list {P} (case sensitive)<<<<<<<<<<<<<<<<<<
\p matches the character p literally (case sensitive)
{S} matches a single character in the list {S} (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d matches a digit (equal to [0-9])
tst
  • 371
  • 1
  • 11