1

I need a regex that will match repeating (more than one) punctuation and symbols. Basically all repeating non-alphanumeric and non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and etc. It must be the same character that's repeated, so not a sequence like "!?@".

I had tried [^\s\w]+ and while that covers all off the !!!, ???, $$$ cases, but that gives me more than what I want since it will also match "!?@".

Can someone enlighten me please? Thanks.

user2017502
  • 215
  • 6
  • 15

4 Answers4

2

Try this pattern:

([.\?#@+,<>%~`!$^&\(\):;])\1+

\1 is referring to the first matched group, which is contents of the parentheses.

You need to extend the list of punctuations and symbols as desired.

Sina Iravanian
  • 16,011
  • 4
  • 34
  • 45
2

I think you're looking for something like this:

[run for run, leadchar in re.findall(r'(([^\w\s])\2+)', yourstring)]

Example:

In : teststr = "4spaces    then(*(@^#$&&&&(2((((99999****"

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)]
Out: ['&&&&', '((((', '****']

This gives you a list of the runs, excluding the 4 spaces in that string as well as sequences like '*(@^'

If that's not exactly what you want, you might edit your question with an example string and precisely what output you wanted to see.

Firoze Lafeer
  • 17,133
  • 4
  • 54
  • 48
1

EDIT: @Firoze Lafeer posted an answer that does everything with a single regular expression. I'll leave this up in case anyone is interested in combining a regular expression with a filtering function, but for this problem it would be simpler and faster to use Firoze Lafeer's answer.

Answer written before I saw Firoze Lafeer's answer is below, unchanged.

A simple regular expression can't do this. The classic pithy summary is "regular expressions can't count". Discussion here:

How to check that a string is a palindrome using regular expressions?

For a Python solution I would recommend combining a regular expression with a little bit of Python code. The regular expression throws out everything that isn't a run of some sort of punctuation, and then the Python code checks to throw out false matches (matches that are runs of punctuation but not all the same character).

import re
import string

# Character class to match punctuation.  The dash ('-') is special
# in character classes, so put a backslash in front of it to make
# it just a literal dash.
_char_class_punct = "[" + re.escape(string.punctuation) + "]"

# Pattern: a punctuation character followed by one or more punctuation characters.
# Thus, a run of two or more punctuation characters.
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+')

def all_same(seq, basis_case=True):
    itr = iter(seq)
    try:
        first = next(itr)
    except StopIteration:
        return basis_case
    return all(x == first for x in itr)

def find_all_punct_runs(text):
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)]


# alternate version of find_all_punct_runs() using re.finditer()
def find_all_punct_runs(text):
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

I wrote all_same() the way I did so that it will work just as well on an iterator as on a string. The Python built-in all() returns True for an empty sequence, which is not what we want for this particular use of all_same(), so I made an argument for the basis case desired and made it default to True to match the behavior of all().

This does as much of the work as possible using the internals of Python (the regular expression engine or all()) so it should be pretty fast. For large input texts you might want to rewrite find_all_punct_runs() to use re.finditer() instead of re.findall(). I gave an example. The example also returns a generator expression rather than a list. You can always force it to make a list:

lst = list(find_all_punct_runs(text))
Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117
  • `-` and `[` (not sure for Python) and `]` are special in character class, so is `^` if at beginning. – nhahtdh Feb 01 '13 at 03:33
  • Try `re.escape(string.punctuation)` instead. That works. (Confirmation that it is correct: `all(re.match('[%s]' % re.escape(string.punctuation), letter) for letter in string.punctuation) == True`.) – Chris Morgan Feb 01 '13 at 03:38
  • @ChrisMorgan: Wow, that's so much better. It's obvious what it's doing and I don't need to worry about whether I got it right. – steveha Feb 01 '13 at 03:41
  • @nhahtdh: thanks for pointing that out. I thought about `^`, and it's not magic unless it's at the beginning. I thought about `[`, and inside a character class it's just another character in the class. But I failed to consider `]` and that meant my pattern wasn't correct. As suggested by Chris Morgan, the proper thing to do is just call `re.escape()` which will handle everything! – steveha Feb 01 '13 at 03:45
  • Regex in most of the modern language are not regular language in the strictest sense. Backreference can be used for the purpose of checking for repetition of a captured token. – nhahtdh Feb 01 '13 at 03:46
  • I played around with backreferences in Python regular expression matching, just now, in response to your comment. I found that it fails to check that all the characters are identical; I think the backreference is to the character class, and the repeated backreference just checks to see if the character class keeps matching. So I can't think of any better answer than what I gave: use regular expressions to find runs of punctuation, then use a function to filter only the runs of the same character. – steveha Feb 01 '13 at 04:05
  • @steveha, not sure what you mean? See my answer as example. – Firoze Lafeer Feb 01 '13 at 04:09
  • 1
    @FirozeLafeer -- What I tried didn't work. Your answer does work. I just learned something about regular expressions in Python, and I thank you for that. I'll edit my answer to mention yours. – steveha Feb 01 '13 at 04:14
0

This is how I would do it:

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})'
>>> print [m.group(0) for m in re.finditer(reg,st)]

or

>>> print [g for g,l in re.findall(reg, st)]

Either one prints:

['...', '???', '###', '@@@', '+++']
dawg
  • 98,345
  • 23
  • 131
  • 206