Separate a string between each two neighbouring different digits via `re.split` DIRECTLY (in Python)?

Question

For instance, I'd like to convert "91234 5g556７\t7₇89^" into ["9","1","2","3","4 5g55","6７\t7₇8","9^"]. Of course this can be done in a for loop without using any regular expressions, but I want to know if this can be done via a singular regular expression. At present I find two ways to do so:

>>> import re
>>> def way0(char: str):
...     delimiter = ""
...     while True:
...         delimiter += " "
...         if delimiter not in char:
...             substitution = re.compile("([0-9])(?!\\1)([0-9])")
...             replacement = "\\1"+delimiter+"\\2"
...             cin = [char]
...             while True:
...                 cout = []
...                 for term in cin: cout.extend(substitution.sub(replacement,term).split(delimiter))
...                 if cout == cin:
...                     return cin
...                 else:
...                     cin = cout
...
>>> way0("91234 5g556７\t7₇89^")
['9', '1', '2', '3', '4 5g55', '6７\t7₇8', '9^']
>>> import functools
>>> way1 = lambda w: ["".join(list(y)) for x, y in itertools.groupby(re.split("(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)", w), lambda z: z != "") if x]
>>> way1("91234 5g556７\t7₇89^")
['9', '1', '2', '3', '4 5g55', '6７\t7₇8', '9^']

However, neither way0 nor way1 is concise (and ideal). I have read the help page of re.split; unfortunately, the following code does not return the desired output:

>>> re.split(r"(\d)(?!\1)(\d)","91234 5g556７\t7₇89^")
['', '9', '1', '', '2', '3', '4 5g5', '5', '6', '７\t7₇', '8', '9', '^']

Can re.split solve this problem directly (that is, without extra conversions)? (Note that here I don't focus on the efficiency.)

There are some questions of this topic before (for example, Regular expression of two digit number where two digits are not same, Regex to match 2 digit but different numbers, and Regular expression to match sets of numbers that are not equal nor reversed), but they are about "RegMatch". In fact, my question is about "RegSplit" (rather than "RegMatch" or "RegReplace").

A two step idea: [`re.sub(r"(?<=([0-9]))(?=[0-9])(?!\1)","-",s).split("-")`](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEolYurWMFWQcnS0MjYRME03dTU7P2e6TEl5o@a2i0s45S4uIpSQQqKUvWKS5M0ipQ07G1sNaINdC1jNTU17G0hLA17xRhDTSUdJV0lnWJNveKCnMwSDSBHk4uroCgzr0QDaIjm//8A) — bobble bubble, Aug 15 '23 at 10:47
If you want to use only split and don't use a capturing group, this would probably work but I doubt it's more efficient: [`re.split(r"(?=[0-9])(?<=(?!00|11|22|33|44|55|66|77|88|99)[0-9])", s)`](https://tio.run/##Jcy9DQIhGADQnimQChI13PGfSBjEn44oid4RoDH5KhtHcBmncQFHQI39y8vXdpon0Xu65Lk0XCJCFXtM3DAKidVRKf1@PnbNvG536w4EoRJ/oMR1zefUaCE0@C1fuT2jYeNpWHAOwwDjCEKAlKAUaA3GgLXgHPtTssSVIZRLmr5FrKz3Dw) — bobble bubble, Aug 15 '23 at 11:03
Another possibility: `[m.group(0) for m in re.finditer(r'.*?(?:([0-9])(?=[0-9])(?!\1)|.$)', s)]` — Nick, Aug 15 '23 at 13:08
@bobblebubble Many thanks. The latter way is just what I need. (Surprisingly, when I execute the analogues in another programming language, the "two step idea" is less efficient instead.) — user688486, Aug 20 '23 at 06:30
@Nick Also thanks. It appears that the indirect ``re.finditer`` is more useful than the direct ``re.findall`` here. Right? — user688486, Aug 20 '23 at 06:35
@user688486 it's a little more difficult to use `re.findall` because of the capturing group needed for the negative lookahead, although still doable if you add a second capture group into the regex e.g. `[t[0] for t in re.findall(r'(.*?(?:([0-9])(?=[0-9])(?!\2)|.$))', s)]` — Nick, Aug 20 '23 at 07:19
@user688486 You're welcome! I put this as an answer, glad it helped. :) Also Nick's idea looks very nice, mabye you get more answers. — bobble bubble, Aug 20 '23 at 13:04
@Nick This way is really amazing. I never knew these before. Thanks again! — user688486, Aug 20 '23 at 13:30

bobble bubble · Accepted Answer · 2023-08-20T13:06:27.950

1

If you want to solve this using re.split without capturing and any further processing in one step, an idea is to use only lookarounds and in the lookbehind disallow two same digits looking ahead.

(?=[0-9])(?<=(?!00|11|22|33|44|55|66|77|88|99)[0-9])

See this demo at regex101 or the Python demo at tio.run

The way it works is obvious. The lookarounds find any position between two digits. Inside the lookbehind the negative lookahead prevents matching (before) if two same digits are ahead.

I used [0-9] and not \d because unsure if \d matches unicode digits in your Python version.

edited Aug 20 '23 at 13:06

answered Aug 20 '23 at 12:52

bobble bubble

16,888
3
27
46

1

Thank you. Using `\d` with `flags=re.A` also works, but this only leads to more keystrokes in fact. :) – user688486 Aug 20 '23 at 13:45

score 1 · Answer 2 · answered Aug 20 '23 at 22:35

You can solve this with re.finditer and re.findall, although it is a little more complicated with findall due to the capture group required for a negative lookahead (since findall returns the contents of capture groups in its result).

s ="91234 5g556７\t7₇89^"

# re.finditer
[m.group(0) for m in re.finditer(r'.*?(?:([0-9])(?=[0-9])(?!\1)|.$)', s)]

# re.findall
[t[0] for t in re.findall(r'(.*?(?:([0-9])(?=[0-9])(?!\2)|.$))', s)]

In both cases the answer is

['9', '1', '2', '3', '4 5g55', '6７\t7₇8', '9^']

Python demo at tio.run

Separate a string between each two neighbouring different digits via `re.split` DIRECTLY (in Python)?

2 Answers2