3

I need to replace two or more repeated punctuation for space on some string.

"asdasdasd - adasdasd asda ------- asda wadsda +-----+ wwww qqqqqq aaaaa"

to

"asdasdasd - adasdasd asda -  asda wadsda +- + wwww qqqqqq aaaaa"

Using regex101 app I've created this one:

https://regex101.com/r/vdR5T1/1/

But when I tried on python:

import re
texto = "asdasdasd - adasdasd asda ------- asda wadsda +-----+ wwww qqqqqq aaaaa"
rx = re.compile(r'([[:punct:]])\1{2,}')
texto = rx.sub(' ', texto)
print(texto)

I've got this error:

FutureWarning: Possible nested set at position 2
  rx = re.compile(r'([[:punct:]])\1{2,}')

How can I run this (or a similar) regex using python?

Nick
  • 138,499
  • 22
  • 57
  • 95
celsowm
  • 846
  • 9
  • 34
  • 59
  • `re` does not recognize POSIX bracket expressions. You can use the [regex](https://pypi.org/project/regex/) module that does. – dawg Dec 04 '20 at 02:04

2 Answers2

4

Python re doesn't recognise POSIX bracket expressions, so [[:punct:]] looks like a nested character class (hence the warning message). You can replace it with a character class which contains all punctuation characters e.g. [!-/:-@[-`{-~]. Note that your regex requires 3 or more of the same character (the initial capture group plus 2 or more repetitions), you just want + instead of {2,} and you need to replace with \1 to get the repeated character once in the output:

import re
texto = "asdasdasd - adasdasd asda ------- asda wadsda +-----+ wwww -- qqqqqq aaaaa"
rx = re.compile(r'([!-/:-@[-`{-~])\1+')
texto = rx.sub(r'\1 ', texto)
print(texto)

Output:

asdasdasd - adasdasd asda -  asda wadsda +- + wwww -  qqqqqq aaaaa
Nick
  • 138,499
  • 22
  • 57
  • 95
  • @celsowm I just realised I based my answer on your code, not your expected output, so it produced the wrong output. I've updated to match your expected output (and also demonstrate it working for a `--` string – Nick Dec 04 '20 at 01:57
  • can you update with all chars of :punct: ? https://www.petefreitag.com/cheatsheets/regex/character-classes/ – celsowm Dec 04 '20 at 02:22
  • @celsowm the four ranges in my character class should cover all of them see http://www.asciitable.com/ – Nick Dec 04 '20 at 02:25
  • there are only one strange behavior: when the pattern is on the end, like: texto = "asdasdasd - adasdasd asda ------- asda wadsda +-----+ wwww qqqqqq aaaaa +++" – celsowm Dec 04 '20 at 02:40
  • @celsowm how is that behaviour strange? – Nick Dec 04 '20 at 02:49
  • any repetitive punctaction in the end of the string is erased – celsowm Dec 04 '20 at 12:45
  • @celsowm the `+++` on the end is replaced with a single `+` and a space. Is that not what you want? You don't say so in your question. – Nick Dec 04 '20 at 19:48
1

It is a known fact that Python re library does not support POSIX character classes. Note that re parses [[:punct:]] as a [[:punct:] character class and then a literal ] char, and it matches [], :], p], etc. strings.

The [:punct:] POSIX character classes a set of chars that belong to Punctuation and Symbol Unicode classes, and their current list can be matched with a very long character class that includes the corresponding ranges. Note that [!-/:-@[-`{-~] is just a small :punct: subset that only matches ASCII punctuation/symbols.

Here is a full Unicode aware regex that can match repeated :punct: chars and replace them with the first one matched:

import re
re_punct = r'[!-/\:-@\[-`\{-~\u00A1-\u00A9\u00AB\u00AC\u00AE-\u00B1\u00B4\u00B6-\u00B8\u00BB\u00BF\u00D7\u00F7\u02C2-\u02C5\u02D2-\u02DF\u02E5-\u02EB\u02ED\u02EF-\u02FF\u0375\u037E\u0384\u0385\u0387\u03F6\u0482\u055A-\u055F\u0589\u058A\u058D-\u058F\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0606-\u060F\u061B\u061E\u061F\u066A-\u066D\u06D4\u06DE\u06E9\u06FD\u06FE\u0700-\u070D\u07F6-\u07F9\u07FE\u07FF\u0830-\u083E\u085E\u0964\u0965\u0970\u09F2\u09F3\u09FA\u09FB\u09FD\u0A76\u0AF0\u0AF1\u0B70\u0BF3-\u0BFA\u0C77\u0C7F\u0C84\u0D4F\u0D79\u0DF4\u0E3F\u0E4F\u0E5A\u0E5B\u0F01-\u0F17\u0F1A-\u0F1F\u0F34\u0F36\u0F38\u0F3A-\u0F3D\u0F85\u0FBE-\u0FC5\u0FC7-\u0FCC\u0FCE-\u0FDA\u104A-\u104F\u109E\u109F\u10FB\u1360-\u1368\u1390-\u1399\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DB\u1800-\u180A\u1940\u1944\u1945\u19DE-\u19FF\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B6A\u1B74-\u1B7C\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u1FBD\u1FBF-\u1FC1\u1FCD-\u1FCF\u1FDD-\u1FDF\u1FED-\u1FEF\u1FFD\u1FFE\u2010-\u2027\u2030-\u205E\u207A-\u207E\u208A-\u208E\u20A0-\u20BF\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116-\u2118\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u2140-\u2144\u214A-\u214D\u214F\u218A\u218B\u2190-\u2426\u2440-\u244A\u249C-\u24E9\u2500-\u2775\u2794-\u2B73\u2B76-\u2B95\u2B98-\u2BFF\u2CE5-\u2CEA\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E4F\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3001-\u3004\u3008-\u3020\u3030\u3036\u3037\u303D-\u303F\u309B\u309C\u30A0\u30FB\u3190\u3191\u3196-\u319F\u31C0-\u31E3\u3200-\u321E\u322A-\u3247\u3250\u3260-\u327F\u328A-\u32B0\u32C0-\u32FE\u3300-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA700-\uA716\uA720\uA721\uA789\uA78A\uA828-\uA82B\uA836-\uA839\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA8FC\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAA77-\uAA79\uAADE\uAADF\uAAF0\uAAF1\uAB5B\uABEB\uFB29\uFBB2-\uFBC1\uFD3E\uFD3F\uFDFC\uFDFD\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE66\uFE68-\uFE6B\uFF01-\uFF0F\uFF1A-\uFF20\uFF3B-\uFF40\uFF5B-\uFF65\uFFE0-\uFFE6\uFFE8-\uFFEE\uFFFC\uFFFD\U00010100-\U00010102\U00010137-\U0001013F\U00010179-\U00010189\U0001018C-\U0001018E\U00010190-\U0001019B\U000101A0\U000101D0-\U000101FC\U0001039F\U000103D0\U0001056F\U00010857\U00010877\U00010878\U0001091F\U0001093F\U00010A50-\U00010A58\U00010A7F\U00010AC8\U00010AF0-\U00010AF6\U00010B39-\U00010B3F\U00010B99-\U00010B9C\U00010F55-\U00010F59\U00011047-\U0001104D\U000110BB\U000110BC\U000110BE-\U000110C1\U00011140-\U00011143\U00011174\U00011175\U000111C5-\U000111C8\U000111CD\U000111DB\U000111DD-\U000111DF\U00011238-\U0001123D\U000112A9\U0001144B-\U0001144F\U0001145B\U0001145D\U000114C6\U000115C1-\U000115D7\U00011641-\U00011643\U00011660-\U0001166C\U0001173C-\U0001173F\U0001183B\U000119E2\U00011A3F-\U00011A46\U00011A9A-\U00011A9C\U00011A9E-\U00011AA2\U00011C41-\U00011C45\U00011C70\U00011C71\U00011EF7\U00011EF8\U00011FD5-\U00011FF1\U00011FFF\U00012470-\U00012474\U00016A6E\U00016A6F\U00016AF5\U00016B37-\U00016B3F\U00016B44\U00016B45\U00016E97-\U00016E9A\U00016FE2\U0001BC9C\U0001BC9F\U0001D000-\U0001D0F5\U0001D100-\U0001D126\U0001D129-\U0001D164\U0001D16A-\U0001D16C\U0001D183\U0001D184\U0001D18C-\U0001D1A9\U0001D1AE-\U0001D1E8\U0001D200-\U0001D241\U0001D245\U0001D300-\U0001D356\U0001D6C1\U0001D6DB\U0001D6FB\U0001D715\U0001D735\U0001D74F\U0001D76F\U0001D789\U0001D7A9\U0001D7C3\U0001D800-\U0001D9FF\U0001DA37-\U0001DA3A\U0001DA6D-\U0001DA74\U0001DA76-\U0001DA83\U0001DA85-\U0001DA8B\U0001E14F\U0001E2FF\U0001E95E\U0001E95F\U0001ECAC\U0001ECB0\U0001ED2E\U0001EEF0\U0001EEF1\U0001F000-\U0001F02B\U0001F030-\U0001F093\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F110-\U0001F16C\U0001F170-\U0001F1AC\U0001F1E6-\U0001F202\U0001F210-\U0001F23B\U0001F240-\U0001F248\U0001F250\U0001F251\U0001F260-\U0001F265\U0001F300-\U0001F6D5\U0001F6E0-\U0001F6EC\U0001F6F0-\U0001F6FA\U0001F700-\U0001F773\U0001F780-\U0001F7D8\U0001F7E0-\U0001F7EB\U0001F800-\U0001F80B\U0001F810-\U0001F847\U0001F850-\U0001F859\U0001F860-\U0001F887\U0001F890-\U0001F8AD\U0001F900-\U0001F90B\U0001F90D-\U0001F971\U0001F973-\U0001F976\U0001F97A-\U0001F9A2\U0001F9A5-\U0001F9AA\U0001F9AE-\U0001F9CA\U0001F9CD-\U0001FA53\U0001FA60-\U0001FA6D\U0001FA70-\U0001FA73\U0001FA78-\U0001FA7A\U0001FA80-\U0001FA82\U0001FA90-\U0001FA95]'
text = "asdasdasd - adasdasd asda ------- asda wadsda +-----+ wwww qqqqqq aaaaa +++"
print( re.sub(fr'({re_punct})\1+', r'\1 ', text).rstrip() )
# => asdasdasd - adasdasd asda -  asda wadsda +- + wwww qqqqqq aaaaa +

Note the .rstrip() at the end will strip the space if it was added when replacing a match at the end of string.

If you install the PyPi regex module (pip install regex / pip3 install regex), you can enjoy full Unicode support, and I highly recommend using the PyPi regex module, especially if you need both Unicode support, complex patterns to use and large texts to parse.

import regex
text = "asdasdasd - adasdasd asda ------- asda wadsda +-----+ wwww qqqqqq aaaaa +++"
print( regex.sub(r'([[:punct:]])\1+', r'\1 ', text).rstrip() )
# => asdasdasd - adasdasd asda -  asda wadsda +- + wwww qqqqqq aaaaa +

See the Python demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563