1

I have a large docx file that has the below interspersed throughout:

PART I
PART II
PART III
PART IIIA  # part 3, section A
PART IV
PART V
PART VI
PART VII

I'm trying to write a regex in python that will pull these out, with re.match, re.findall, or re.search. I can't figure out the correct regex syntax to only pull the above out - and nothing on SO or anywhere gives an example of correctly pulling out roman numerals.

Many examples on how to convert / validate, but nothing on simple regex matching. I was going off of this:

[PART].*\s[I]|[II]|[III]|[IIIA]|[IV]|[V]|[VI]|[VII] 

or

[PART].*\s(?=[MDCLXVI])M*(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$

But that doesn't work - I'm messing up the "or" part. Best other SO article I could find

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
papelr
  • 468
  • 1
  • 11
  • 42

1 Answers1

1

Characters inside square brackets match just one of the listed characters, so in your case you shouldn't put PART inside square brackets. You also don't need $ in the end because you're trying to match a substring in a bigger string.

Assuming your input string is stored in variable s, the following call to re.findall should return all such occurrences in a list:

re.findall(r'PART\s+(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})', s)

Demo: https://regex101.com/r/NGdyw3/2

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Awesome - but how do I get it to return `PART` along with the roman numerals? I'm using `re.findall` and it only shows the roman numerals. Curious as to how the query would be edited – papelr Mar 04 '19 at 20:01
  • I see. Try using `re.findall` and change all capturing groups to non-capturing ones. I've updated my answer accordingly then. – blhsing Mar 04 '19 at 20:04