Python & regex to pull out headers from a docx with roman numerals

Question

I have a large docx file that has the below interspersed throughout:

PART I
PART II
PART III
PART IIIA  # part 3, section A
PART IV
PART V
PART VI
PART VII

I'm trying to write a regex in python that will pull these out, with re.match, re.findall, or re.search. I can't figure out the correct regex syntax to only pull the above out - and nothing on SO or anywhere gives an example of correctly pulling out roman numerals.

Many examples on how to convert / validate, but nothing on simple regex matching. I was going off of this:

[PART].*\s[I]|[II]|[III]|[IIIA]|[IV]|[V]|[VI]|[VII]

or

[PART].*\s(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$

But that doesn't work - I'm messing up the "or" part. Best other SO article I could find

Did you check https://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression like for example https://regex101.com/r/yhEbuV/2 — The fourth bird, Mar 04 '19 at 20:08

blhsing · Accepted Answer · 2019-03-04T20:03:52.677

1

Characters inside square brackets match just one of the listed characters, so in your case you shouldn't put PART inside square brackets. You also don't need $ in the end because you're trying to match a substring in a bigger string.

Assuming your input string is stored in variable s, the following call to re.findall should return all such occurrences in a list:

re.findall(r'PART\s+(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})', s)

Demo: https://regex101.com/r/NGdyw3/2

edited Mar 04 '19 at 20:03

answered Mar 04 '19 at 19:51

blhsing

91,368
6
71
106

Awesome - but how do I get it to return `PART` along with the roman numerals? I'm using `re.findall` and it only shows the roman numerals. Curious as to how the query would be edited – papelr Mar 04 '19 at 20:01
I see. Try using `re.findall` and change all capturing groups to non-capturing ones. I've updated my answer accordingly then. – blhsing Mar 04 '19 at 20:04

Python & regex to pull out headers from a docx with roman numerals

1 Answers1