1

For instance, I'd like to convert "91234 5g5567\t7₇89^" into ["9","1","2","3","4 5g55","67\t7₇8","9^"]. Of course this can be done in a for loop without using any regular expressions, but I want to know if this can be done via a singular regular expression. At present I find two ways to do so:

>>> import re
>>> def way0(char: str):
...     delimiter = ""
...     while True:
...         delimiter += " "
...         if delimiter not in char:
...             substitution = re.compile("([0-9])(?!\\1)([0-9])")
...             replacement = "\\1"+delimiter+"\\2"
...             cin = [char]
...             while True:
...                 cout = []
...                 for term in cin: cout.extend(substitution.sub(replacement,term).split(delimiter))
...                 if cout == cin:
...                     return cin
...                 else:
...                     cin = cout
...
>>> way0("91234 5g5567\t7₇89^")
['9', '1', '2', '3', '4 5g55', '67\t7₇8', '9^']
>>> import functools
>>> way1 = lambda w: ["".join(list(y)) for x, y in itertools.groupby(re.split("(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)", w), lambda z: z != "") if x]
>>> way1("91234 5g5567\t7₇89^")
['9', '1', '2', '3', '4 5g55', '67\t7₇8', '9^']

However, neither way0 nor way1 is concise (and ideal). I have read the help page of re.split; unfortunately, the following code does not return the desired output:

>>> re.split(r"(\d)(?!\1)(\d)","91234 5g5567\t7₇89^")
['', '9', '1', '', '2', '3', '4 5g5', '5', '6', '7\t7₇', '8', '9', '^']

Can re.split solve this problem directly (that is, without extra conversions)? (Note that here I don't focus on the efficiency.)

There are some questions of this topic before (for example, Regular expression of two digit number where two digits are not same, Regex to match 2 digit but different numbers, and Regular expression to match sets of numbers that are not equal nor reversed), but they are about "RegMatch". In fact, my question is about "RegSplit" (rather than "RegMatch" or "RegReplace").

user688486
  • 113
  • 3
  • A two step idea: [`re.sub(r"(?<=([0-9]))(?=[0-9])(?!\1)","-",s).split("-")`](https://tio.run/##K6gsycjPM/7/PzO3IL@oRKEolYurWMFWQcnS0MjYRME03dTU7P2e6TEl5o@a2i0s45S4uIpSQQqKUvWKS5M0ipQ07G1sNaINdC1jNTU17G0hLA17xRhDTSUdJV0lnWJNveKCnMwSDSBHk4uroCgzr0QDaIjm//8A) – bobble bubble Aug 15 '23 at 10:47
  • 1
    If you want to use only split and don't use a capturing group, this would probably work but I doubt it's more efficient: [`re.split(r"(?=[0-9])(?<=(?!00|11|22|33|44|55|66|77|88|99)[0-9])", s)`](https://tio.run/##Jcy9DQIhGADQnimQChI13PGfSBjEn44oid4RoDH5KhtHcBmncQFHQI39y8vXdpon0Xu65Lk0XCJCFXtM3DAKidVRKf1@PnbNvG536w4EoRJ/oMR1zefUaCE0@C1fuT2jYeNpWHAOwwDjCEKAlKAUaA3GgLXgHPtTssSVIZRLmr5FrKz3Dw) – bobble bubble Aug 15 '23 at 11:03
  • 1
    Another possibility: `[m.group(0) for m in re.finditer(r'.*?(?:([0-9])(?=[0-9])(?!\1)|.$)', s)]` – Nick Aug 15 '23 at 13:08
  • @bobblebubble Many thanks. The latter way is just what I need. (Surprisingly, when I execute the analogues in another programming language, the "two step idea" is less efficient instead.) – user688486 Aug 20 '23 at 06:30
  • @Nick Also thanks. It appears that the indirect ``re.finditer`` is more useful than the direct ``re.findall`` here. Right? – user688486 Aug 20 '23 at 06:35
  • 1
    @user688486 it's a little more difficult to use `re.findall` because of the capturing group needed for the negative lookahead, although still doable if you add a second capture group into the regex e.g. `[t[0] for t in re.findall(r'(.*?(?:([0-9])(?=[0-9])(?!\2)|.$))', s)]` – Nick Aug 20 '23 at 07:19
  • @user688486 You're welcome! I put this as an answer, glad it helped. :) Also Nick's idea looks very nice, mabye you get more answers. – bobble bubble Aug 20 '23 at 13:04
  • @Nick This way is really amazing. I never knew these before. Thanks again! – user688486 Aug 20 '23 at 13:30

2 Answers2

1

If you want to solve this using re.split without capturing and any further processing in one step, an idea is to use only lookarounds and in the lookbehind disallow two same digits looking ahead.

(?=[0-9])(?<=(?!00|11|22|33|44|55|66|77|88|99)[0-9])

See this demo at regex101 or the Python demo at tio.run

The way it works is obvious. The lookarounds find any position between two digits. Inside the lookbehind the negative lookahead prevents matching (before) if two same digits are ahead.

I used [0-9] and not \d because unsure if \d matches unicode digits in your Python version.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
1

You can solve this with re.finditer and re.findall, although it is a little more complicated with findall due to the capture group required for a negative lookahead (since findall returns the contents of capture groups in its result).

s ="91234 5g5567\t7₇89^"

# re.finditer
[m.group(0) for m in re.finditer(r'.*?(?:([0-9])(?=[0-9])(?!\1)|.$)', s)]

# re.findall
[t[0] for t in re.findall(r'(.*?(?:([0-9])(?=[0-9])(?!\2)|.$))', s)]

In both cases the answer is

['9', '1', '2', '3', '4 5g55', '67\t7₇8', '9^']

Python demo at tio.run

Nick
  • 138,499
  • 22
  • 57
  • 95