36

For example I have string:

 aacbbbqq

As the result I want to have following matches:

 (aa, c, bbb, qq)  

I know that I can write something like this:

 ([a]+)|([b]+)|([c]+)|...  

But I think i's ugly and looking for better solution. I'm looking for regular expression solution, not self-written finite-state machines.

Andrew
  • 8,330
  • 11
  • 45
  • 78

7 Answers7

47

You can match that with: (\w)\1*

Qtax
  • 33,241
  • 9
  • 83
  • 121
26

itertools.groupby is not a RexExp, but it's not self-written either. :-) A quote from python docs:

# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
DrTyrsa
  • 31,014
  • 7
  • 86
  • 86
  • @Kobi `aaaa bbb aaa`, as expected. Btw it returns list of lists, but it can't be a problem. :-) – DrTyrsa Jun 10 '11 at 12:58
24

Generally

The trick is to match a single char of the range you want, and then make sure you match all repetitions of the same character:

>>> matcher= re.compile(r'(.)\1*')

This matches any single character (.) and then its repetitions (\1*) if any.

For your input string, you can get the desired output as:

>>> [match.group() for match in matcher.finditer('aacbbbqq')]
['aa', 'c', 'bbb', 'qq']

NB: because of the match group, re.findall won't work correctly.

Other ranges

In case you don't want to match any character, change accordingly the . in the regular expression:

>>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
>>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
>>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
>>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore

Check the latter against u'hello²²' (Python 2.x) or 'hello²²' (Python 3.x):

>>> text= u'hello=\xb2\xb2'
>>> print('\n'.join(match.group() for match in matcher.finditer(text)))
h
e
ll
o
²²

\w against non-Unicode strings / bytearrays might be modified if you first have issued a locale.setlocale call.

Community
  • 1
  • 1
tzot
  • 92,761
  • 29
  • 141
  • 204
7

This will work, see a working example here: http://www.rubular.com/r/ptdPuz0qDV

(\w)\1*
Rakesh Sankar
  • 9,337
  • 4
  • 41
  • 66
4

The findall method will work if you capture the back-reference like so:

result = [match[1] + match[0] for match in re.findall(r"(.)(\1*)", string)]
SwiftsNamesake
  • 1,540
  • 2
  • 11
  • 25
3

You can use:

re.sub(r"(\w)\1*", r'\1', 'tessst')

The output would be:

'test'
Wesam Na
  • 2,364
  • 26
  • 23
1

You can try something like this:

import re

string = 'aacbbbqq'
result = re.findall(r'((\w)\2*?)', string)
output = [x[0] for x in result]

print(output)

Output will be :

['aa', 'c', 'bbb', 'qq']