-2

I am trying to identify roman numberals from text with the following regex:

>>>Title="LXXXIV XC, XCII      XXX     LXII"
>>>RomanNum = re.findall(r'[\s,]+M{0,4}[CM|CD|D?C{0,3}]?[XC|XL|L?X{0,3}]?[IX|IV|V?I{0,3}]?[\s,]+', Title, re.M|re.I)`
>>>RomanNum
[' \t']

I want something like:

['LXXXIV', 'XC, 'XCII', 'XXX', 'LXII']

As far as my understanding of regular expression is concerned I think at least XC should have been matched. XC should match [XC|XL|L?X{0,3}] part of regular expression above with whitespace before and a comma after it which is captured by the above regex. What am I missing?

Apart from that I can achieve the desired result as following(but greater complexity which I want to avoid):

>>>RomanNum = [re.search(r'^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$', TitleElem, re.M|re.I) for TitleElem in re.split(',| ', Title)]`

Any help appreciated.

Aman Deep Gautam
  • 8,091
  • 21
  • 74
  • 130
  • 1
    Is there any reason you can't use `[IVXLCDM]+`? – jonrsharpe Jul 20 '14 at 15:49
  • @jonrsharpe They would not form a valid roman numeral – Aman Deep Gautam Jul 20 '14 at 15:52
  • And have you tried any of the other regexes for matching Roman numerals (e.g. see the Related questions in the right-hand sidebar)? – jonrsharpe Jul 20 '14 at 15:53
  • @jonrsharpe I took the regex from here: http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression. Understood it. Tried to modify it for my own use. Does not work(which means I didn't understood properly and hence the question) – Aman Deep Gautam Jul 20 '14 at 15:59
  • 1
    So what exactly is the difference between what that regex does and what you want? The answer provides a step-by-step explanation of how it works - what is the issue? – jonrsharpe Jul 20 '14 at 16:03
  • For what strange reason do you change parenthesis to brackets? – Casimir et Hippolyte Jul 20 '14 at 16:07
  • @CasimiretHippolyte realized the mistake now. – Aman Deep Gautam Jul 20 '14 at 16:14
  • @jonrsharpe I want it to match anywhere in the string(sentence) and hence instead of `^` and `$` markers need to word boundaries(`\b`) which does not seem to work. So while this expression(copied from other answer):`^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$` works this: `\bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b` doesn't. – Aman Deep Gautam Jul 21 '14 at 06:02

3 Answers3

1

Your regex syntax is off at this point:

XC should match [XC|XL|L?X{0,3}]

because you use square brackets, where you describe the behavior of round parentheses. Change the square brackets to round ones to correct.

This error is repeated in other parts of your full regex.

Jongware
  • 22,200
  • 8
  • 54
  • 100
1

If you want to find several roman numbers in a string with the findall or finditer method, one possible pattern is:

(?=[MDCXLVI])(?<![MDCXLVI])M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})(?![MDCXLVI])

It's a bit long and I will explain why I think it is efficient:

(?=[MDCXLVI]) is a lookahead that checks if the position is followed by one of these characters. This lookahead has two functions:

  • The first is to emulate a kind of first-character discrimination to quickly avoid all positions that don't contain one of these characters (In this way, the regex engine don't need to test all possible beginings with M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})).

  • The second checks if there is at least one character, since M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}) can match an empty string.

(?<![MDCXVLI]) and (?![MDCXVLI]) are used as boundaries to ensure there are no other "roman characters" around (otherwise a substring like ILVIII will return LVIII as result instead of skipping the entire group of characters with a wrong format). Note that other kind of boundaries are possible, like \b or (?<![^\s,]) (?![^\s,]) ... depending of the string format. Note too, that the left boundary is placed only after (?=[MDCXVLI]) to not break the first-character discrimination.

Alternations like CM|CD are reduced to C[MD].

The pattern use only non-capturing groups (?:...) to preserve memory and avoid uneeded storage tasks.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • For `string="XXXII, LXIII LXV LXVII ABC RAM, ED"` the following: `re.findall('(?:\b)M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})(?:\b)', string)` return `[]`. Why? – Aman Deep Gautam Jul 21 '14 at 06:03
  • @AmanDeepGautam: because when you don't use a raw string (i.e. `r'....'`), all the backslashes must be escaped: `re.findall('(?=[MDCXLVI])\\bM{0,4...\\b', string)` or `re.findall(r'(?=[MDCXLVI])\bM{0,4...\b', string)`. As an aside, you don't need to put `\b` in a group. – Casimir et Hippolyte Jul 21 '14 at 11:45
0

Dive Into Python provides a nice regex for detecting Roman Numerals. They also provide a sample script that you can utilize to start. This script comes from section 7.5 of my first link.

#Define pattern to detect valid Roman numerals
romanNumeralPattern = re.compile("""
    ^                   # beginning of string
    M{0,4}              # thousands - 0 to 4 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """ ,re.VERBOSE)
Andy
  • 49,085
  • 60
  • 166
  • 233