How to Split a String from substrings regex patterns with placeholders and keeping the delimiter

Question

Similar questions have been answered, yet I did not find the right one for the following case.

Close related questions:

How do I use regular expressions in Python with placeholder text?

How to replace placeholders in python strings

In Python, how do I split a string and keep the separators?

I need to split this string to get the following output:

txt = "MCMXLIII"

Needed output:

['M', 'CM', 'XL', 'III']

I tested this:

import re

txt = "MCMXLIII"

print(re.sub('(CM)|(XL)', '❤️{0}❤️|❤️{1}❤️', txt).format("CM", "XL").split('❤️'))

Output:

['M', 'CM', '|', 'XL', '', 'CM', '|', 'XL', 'III']

I also tested this:

import re

txt = "MCMXLIII"


print(re.sub('(CM)|(XL)', '(❤️{0}❤️|❤️{1}❤️)', txt).format("CM", "XL").split('❤️'))

Output:

['M(', 'CM', '|', 'XL', ')(', 'CM', '|', 'XL', ')III']

Finally I tested this:


import re

txt = "MCMXLIII"

print(re.sub('(XL)', '❤️{0}❤️', (re.sub('(CM)', '❤️{0}❤️', txt).format("CM"))).format("XL").split('❤️'))

Output:
['M', 'CM', '', 'XL', 'III']

If that's of help here's the correct output in Google Sheets formula version:

=SPLIT(
IF(
REGEXMATCH(A1,"(CM)|(XL)"),REGEXREPLACE(A1, "(CM)|(XL)","❤️$0❤️")),"❤️")

Google Sheets output:

Can't you just take the results of your last Python regex & remove the blank entries? — Scott Hunter, Aug 15 '23 at 14:38
It seems you want to split Roman numbers into thousands, hundreds, tens and ones. Right? Then, it makes sense to write a matching/extracting regex for that. — Wiktor Stribiżew, Aug 15 '23 at 14:40
@ScottHunter thanks for the suggestion it seems a good one. But I'm curious for a better or simpler way to do it as my last test seems a convoluted approach. — Lod, Aug 15 '23 at 14:41
@WiktorStribiżew thanks for the suggestion. I'll look into the match approach asap. — Lod, Aug 15 '23 at 14:44
After many attempts here is the correct output: Online Compiler: https://onecompiler.com/python/3zhn8h532 Code: https://pastecode.io/s/mx1s7ov0 I still would be very grateful for a simpler way without all those embeded clauses. — Lod, Aug 15 '23 at 21:33

score 2 · Answer 1 · answered Aug 15 '23 at 16:01

Taking @Wiktor's hint, here is a regex that parses well-formed roman numerals into 4 capture groups containing thousands, hundreds,tens and ones places, respectively and rejects any malformed roman numeral.

^(M*)(C?D|D?C{1,3}|CM)?(X?L|L?X{1,3}|XC)?(I?V|V?I{1,3}|IX)?$

All the groups except the thousands have the same regex pattern; alternation with the first option handling the 4 & 5 values, the second handling 1-3 and 6-8 values, and the last for the 9 value.

It has a minor flaw that also successfully scans an empty string as a Roman Numeral, but you can check for that in your python code before or after calling this regex.

Here's a few examples from regex101:

Many tanks for that idea and test. Very interesting and worth studying. — Lod, Aug 15 '23 at 20:35
Rather than posting a picture of regex101 output, just provide a link to that page. — Nick, Aug 16 '23 at 00:53

How to Split a String from substrings regex patterns with placeholders and keeping the delimiter

1 Answers1