Replace greedy elements with the appropriate

Question

I have such a following list:

import re
l = ['Part I,   Where I’M Coming From',
 'Part Ii,  Life Principles',
 'Part Iii, Work Principles']

I want a result,

l = ['Part I,   Where I’M Coming From',
     'Part II,  Life Principles',
     'Part III, Work Principles']

I tried:

In [19]: [re.sub(r'(?<=I)i+', 'I+', s) for s in l]
Out[19]:
['Part I,   Where I’M Coming From',
 'Part II+,  Life Principles',
 'Part II+, Work Principles']

It output 'Part II+, Work Principles' not 'Part III, Work Principles'

How to accomplish such a task?

@MenglongLi Good question, I've tried to address that in my answer. — cs95, Jan 12 '18 at 02:53

cs95 · Accepted Answer · 2018-01-12T02:54:53.897

One easy way to do this is to use re.sub with a callback function. The callback handles more complicated logic beyond simple substitution. In your case, you need to match all lowercase is following capital Is, figure out how many i's there are, and replace accordingly.

>>> re.sub('(?<=I)(i+)', lambda x: 'I' * len(x.group()), 'Part Iii,  Work Principles')
'Part III,  Work Principles'

The callback is not invoked (i.e., no replacement occurs) if there was no match.

If you're interested in a deeper understanding of what happens, here's the same callback as a function, with a couple of print statements.

>>> def replace(m):
...     print(*[m, m.group(), len(m.group())], sep='\n')
...     return 'I' * len(m.group())
... 
>>> re.sub('(?<=I)(i+)', replace, 'Part Iii,  Work Principles')
<_sre.SRE_Match object; span=(6, 8), match='ii'>
ii
2
'Part III,  Work Principles'

You'll notice this prints out...

<_sre.SRE_Match object; span=(6, 8), match='ii'>
ii
2

...In addition to performing the replacement. The important thing to note is that it passes a match object to the callback function. You can then figure out what was matched, and decide what to replace it with accordingly.

Generalising to Arbitrary Roman Numerals

If your function has to match any roman numerals, then you can pass a pattern that finds those to re.sub, but your callback simplifies greatly:

>>> p = r'\bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b'
>>> string = 'Part viiI,  Work Principles'
>>> re.sub(p, lambda x: x.group().upper(), string, flags=re.IGNORECASE)
'Part VIII,  Work Principles'

Now, all you need to do is uppercase the matched string.

Thank you for taking the time to break everything down. – Moondra Jan 12 '18 at 04:15 — Moondra, Jan 12 '18 at 04:15

score 0 · Answer 2 · answered Jan 12 '18 at 03:23

One option is to simply use re.split, apply str.upper, and then use str.format:

import re
l = ['Part I,   Where I’M Coming From',
'Part Ii,  Life Principles',
'Part Iii, Work Principles']
new_l = [re.split('(?<=Part)\s|,\s+', i) for i in l]
final_l = ['{} {},  {}'.format(a, b.upper(), c) for a, b, c in new_l]

Output:

l = ['Part I,   Where I’M Coming From',
 'Part II,  Life Principles',
 'Part III, Work Principles']

Replace greedy elements with the appropriate

2 Answers2