-2

I want to split a string by boundaries between non-repeating characters with Python. I wrote this regex:

(?<=(.))(?!\\1)', string)

So I expecting "aaab447777BBBBbbb" will be splitted to ['aaa', 'b', '44', '7777', 'BBBB', ''bbb]

I used the same regex in Java and got the desired result. Unfortunately, this does not work in Python. When I try

re.split('(?<=(.))(?!\\1)', string)

the result is

['aaa', 'a', 'b', 'b', '44', '4', '7777', '7', 'BBBB', 'B', 'bbb', 'b', '']

When I do

re.findall('(?<=(.))(?!\\1)', string)

returns

['a', 'b', '4', '7', 'B', 'b']

Why doesn't Python understand the regular expression that Java understands and how to solve the problem in Python?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Igor Gindin
  • 7
  • 1
  • 3
  • 2
    Why not match them instead? `(.)\\1*` I think you can do it like this using split and the regex module `(?=(.)(?<!(?:\1|^).))` See https://regex101.com/r/FGRDYs/1 – The fourth bird Oct 16 '20 at 12:53
  • Is the question just "why" or do you need a solution/fix? – Wiktor Stribiżew Oct 16 '20 at 13:00
  • Both. I need a solution and trying to understand why this does not work. BTW, the both regex, my and of The fourth bird, working on this site, regex101.com, but do not working in a python script. – Igor Gindin Oct 16 '20 at 13:20

2 Answers2

0

If you're open to non-regex solutions, this is a perfect application for itertools.groupby

>>> [''.join(g) for k, g in groupby('aaab447777BBBBbbb')]
['aaa', 'b', '44', '7777', 'BBBB', 'bbb']
Cory Kramer
  • 114,268
  • 16
  • 167
  • 218
0

The "why" is simple: in Python, as also in Perl and Ruby and JavaScript, using a capture group in the pattern passed to split means that you want whatever is captured there to be included in the returned array. This is useful when you want to allow multiple delimiters but be able to tell which was used in each position. It does, however, mean that you get extra results if you're trying to do something fancy like your example. Your regex has to capture each repeated character in order to detect its repetition, but split can't tell that those captures aren't for its benefit, so it includes those single-character strings in the returned array.

This result is completely predictable, though. The returned array will include the sections you want followed by the single characters that they consist of repetitions of. So you can always take just the even elements to get your desired result:

>>> re.split(r'(?<=(.))(?!\1)', string)[::2]
['aaa', 'b', '44', '7777', 'BBBB', 'bbb', '']

(It's a good idea to use "raw" strings (r'...') for regexes so you don't have to double all your backslashes and quadruple all your backslashed backslashes...)

But that combination of positive lookbehind and negative lookahead with split seems overly complex for what you're doing here; that's the sort of thing you normally only resort to in Java when trying to emulate the "capture delimiters" behavior from these other languages. I think something like this would be easier to understand:

>>> [m[0] for m in re.finditer(r'(.)\1*', string)]
['aaa', 'b', '44', '7777', 'BBBB', 'bbb']
Mark Reed
  • 91,912
  • 16
  • 138
  • 175