I am trying to identify roman numberals from text with the following regex:
>>>Title="LXXXIV XC, XCII XXX LXII"
>>>RomanNum = re.findall(r'[\s,]+M{0,4}[CM|CD|D?C{0,3}]?[XC|XL|L?X{0,3}]?[IX|IV|V?I{0,3}]?[\s,]+', Title, re.M|re.I)`
>>>RomanNum
[' \t']
I want something like:
['LXXXIV', 'XC, 'XCII', 'XXX', 'LXII']
As far as my understanding of regular expression is concerned I think at least XC
should have been matched. XC
should match [XC|XL|L?X{0,3}]
part of regular expression above with whitespace before and a comma after it which is captured by the above regex. What am I missing?
Apart from that I can achieve the desired result as following(but greater complexity which I want to avoid):
>>>RomanNum = [re.search(r'^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$', TitleElem, re.M|re.I) for TitleElem in re.split(',| ', Title)]`
Any help appreciated.