EDIT: @Firoze Lafeer posted an answer that does everything with a single regular expression. I'll leave this up in case anyone is interested in combining a regular expression with a filtering function, but for this problem it would be simpler and faster to use Firoze Lafeer's answer.
Answer written before I saw Firoze Lafeer's answer is below, unchanged.
A simple regular expression can't do this. The classic pithy summary is "regular expressions can't count". Discussion here:
How to check that a string is a palindrome using regular expressions?
For a Python solution I would recommend combining a regular expression with a little bit of Python code. The regular expression throws out everything that isn't a run of some sort of punctuation, and then the Python code checks to throw out false matches (matches that are runs of punctuation but not all the same character).
import re
import string
# Character class to match punctuation. The dash ('-') is special
# in character classes, so put a backslash in front of it to make
# it just a literal dash.
_char_class_punct = "[" + re.escape(string.punctuation) + "]"
# Pattern: a punctuation character followed by one or more punctuation characters.
# Thus, a run of two or more punctuation characters.
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+')
def all_same(seq, basis_case=True):
itr = iter(seq)
try:
first = next(itr)
except StopIteration:
return basis_case
return all(x == first for x in itr)
def find_all_punct_runs(text):
return [s for s in _pat_punct_run.findall(text) if all_same(s, False)]
# alternate version of find_all_punct_runs() using re.finditer()
def find_all_punct_runs(text):
return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))
I wrote all_same()
the way I did so that it will work just as well on an iterator as on a string. The Python built-in all()
returns True
for an empty sequence, which is not what we want for this particular use of all_same()
, so I made an argument for the basis case desired and made it default to True
to match the behavior of all()
.
This does as much of the work as possible using the internals of Python (the regular expression engine or all()
) so it should be pretty fast. For large input texts you might want to rewrite find_all_punct_runs()
to use re.finditer()
instead of re.findall()
. I gave an example. The example also returns a generator expression rather than a list. You can always force it to make a list:
lst = list(find_all_punct_runs(text))