1

Similar questions have been answered, yet I did not find the right one for the following case.

Close related questions:

How do I use regular expressions in Python with placeholder text?

How to replace placeholders in python strings

In Python, how do I split a string and keep the separators?

I need to split this string to get the following output:

txt = "MCMXLIII"

Needed output:

['M', 'CM', 'XL', 'III']

I tested this:

import re

txt = "MCMXLIII"

print(re.sub('(CM)|(XL)', '❤️{0}❤️|❤️{1}❤️', txt).format("CM", "XL").split('❤️'))

Output:

['M', 'CM', '|', 'XL', '', 'CM', '|', 'XL', 'III']

I also tested this:

import re

txt = "MCMXLIII"


print(re.sub('(CM)|(XL)', '(❤️{0}❤️|❤️{1}❤️)', txt).format("CM", "XL").split('❤️'))

Output:

['M(', 'CM', '|', 'XL', ')(', 'CM', '|', 'XL', ')III']

Finally I tested this:


import re

txt = "MCMXLIII"

print(re.sub('(XL)', '❤️{0}❤️', (re.sub('(CM)', '❤️{0}❤️', txt).format("CM"))).format("XL").split('❤️'))

Output:
['M', 'CM', '', 'XL', 'III']

If that's of help here's the correct output in Google Sheets formula version:

=SPLIT(
IF(
REGEXMATCH(A1,"(CM)|(XL)"),REGEXREPLACE(A1, "(CM)|(XL)","❤️$0❤️")),"❤️")

Google Sheets output:

splitplaceholders

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Lod
  • 657
  • 1
  • 9
  • 30
  • 1
    Can't you just take the results of your last Python regex & remove the blank entries? – Scott Hunter Aug 15 '23 at 14:38
  • 2
    It seems you want to split Roman numbers into thousands, hundreds, tens and ones. Right? Then, it makes sense to write a matching/extracting regex for that. – Wiktor Stribiżew Aug 15 '23 at 14:40
  • @ScottHunter thanks for the suggestion it seems a good one. But I'm curious for a better or simpler way to do it as my last test seems a convoluted approach. – Lod Aug 15 '23 at 14:41
  • @WiktorStribiżew thanks for the suggestion. I'll look into the match approach asap. – Lod Aug 15 '23 at 14:44
  • After many attempts here is the correct output: Online Compiler: https://onecompiler.com/python/3zhn8h532 Code: https://pastecode.io/s/mx1s7ov0 I still would be very grateful for a simpler way without all those embeded clauses. – Lod Aug 15 '23 at 21:33

1 Answers1

2

Taking @Wiktor's hint, here is a regex that parses well-formed roman numerals into 4 capture groups containing thousands, hundreds,tens and ones places, respectively and rejects any malformed roman numeral.

^(M*)(C?D|D?C{1,3}|CM)?(X?L|L?X{1,3}|XC)?(I?V|V?I{1,3}|IX)?$

All the groups except the thousands have the same regex pattern; alternation with the first option handling the 4 & 5 values, the second handling 1-3 and 6-8 values, and the last for the 9 value.

It has a minor flaw that also successfully scans an empty string as a Roman Numeral, but you can check for that in your python code before or after calling this regex.

Here's a few examples from regex101: enter image description here

Chris Maurer
  • 2,339
  • 1
  • 9
  • 8