2

I have a text which contains latin numbers, like I, II etc which cometimes followed by dot (I.) and some other times without a dot (I). I want to remove them by using regular expressions in python. I can define the following function, but seems quite basic and takes lots of time. I was wondering if there any other way that I could remove them?

def clean(text):
  text = text.replace("Ι.", '&')
  text = text.replace("II.", '&')
  text = text.replace("III.", '&')
  text = text.replace("IV.", '&')
  text = text.replace("V.", '&')
  text = text.replace("VI.", '&')
  text = text.replace("VII.", '&')
  text = text.replace("VIII.", '&')
  text = text.replace("IX.", '&')
  text = text.replace("X.", '&')
  text = text.replace("XI.", '&')
  text = text.replace("XII.", '&')
  text = text.replace("XIII", '&')
  text = text.replace("XIV.", '&')

  return text
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    you could use re.sub with a pattern from this page followed by an optional dot https://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression – The fourth bird Jun 19 '21 at 16:24
  • 1
    Since the dot is optional, your actual text will be important here. How will you distinguish between `I` for the number one and `I` the personal pronoun? – Mark Jun 19 '21 at 16:25
  • 1
    Just do `text = text.replace('Ι', '&').replace('V', '&').replace('X', '&')` will replace all occurrences of I, V or X with & –  Jun 19 '21 at 16:36
  • 1
    George, did you use the first `Ι` example deliberately or is it a copy/paste from some source? This is not the ASCII `I`, it is a *‎0399 GREEK CAPITAL LETTER IOTA*. Now, the question becomes rather unclear: do you want to match all possible variations of letters similar to those used in Roman numbers? Actually, [this](https://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression) will work for ASCII Roman numbers. – Wiktor Stribiżew Jun 19 '21 at 18:39
  • 1
    @Viktor, thank you so much for your note. I guess this came up because I was doing copy paste. –  Jun 20 '21 at 14:26

2 Answers2

3

Use

def clean(text):
    pattern = r"\b(?=[MDCLXVIΙ])M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})([IΙ]X|[IΙ]V|V?[IΙ]{0,3})\b\.?"
    return re.sub(pattern, '&', text)

See regex proof. Add more non-standard letters like Ι if necessary.

EXPLANATION

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [MDCLXVIΙ]          any character of: 'M', 'D', 'C', 'L',
                             'X', 'V', 'I', '&', '#', '9', '2', '1',
                             ';'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  M{0,4}                   'M' (between 0 and 4 times (matching the
                           most amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    CM                       'CM'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    CD                       'CD'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    D?                       'D' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    C{0,3}                   'C' (between 0 and 3 times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    XC                       'XC'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    XL                       'XL'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    L?                       'L' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    X{0,3}                   'X' (between 0 and 3 times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    [IΙ]                any character of: 'I', '&', '#', '9',
                             '2', '1', ';'
--------------------------------------------------------------------------------
    X                        'X'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    [IΙ]                any character of: 'I', '&', '#', '9',
                             '2', '1', ';'
--------------------------------------------------------------------------------
    V                        'V'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    V?                       'V' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    [IΙ]{0,3}           any character of: 'I', '&', '#', '9',
                             '2', '1', ';' (between 0 and 3 times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \.?                      '.' (optional (matching the most amount
                           possible))
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • 1
    Thank you so much for your help! Your code is very good but I faced a problem in this case test2 = "roman MCMLXXXVIII. roman XVII notroman MADXXXL invalid MILLI", the output is 'roman & roman & notroman &ADXXXL invalid &LLI'. However, the code of VPfB deals with this problem. Could I adopt our algo to avoid this case? –  Jun 20 '21 at 14:05
  • @George I think I missed the closing word boundary. I updated the answer. – Ryszard Czech Jun 20 '21 at 20:29
1

Pleae, read first this How do you match only valid roman numerals with a regular expression?

If any regexp shown there is good enough, please use that. If not read on.


Hope this helps, but it is not complete. You have to write a test for valid roman numbers, because the regexp finds any combination of roman numerals. Related: Check if an input is a valid roman numeral

import re

MAYBE_ROMAN = re.compile(r'(\b[MDCLXVI]+\b)(\.)?', re.I)  # I = ignore case (optional)

def is_roman(num):
    # TODO!
    return True

def replace_roman(match):
    roman = match.group(1)
    if is_roman(roman):
        return '&' # replacement
    return roman # unchanged

test = "roman MCMLXXXVIII. roman XVII notroman MADXXXL invalid MILLI"
result = re.sub(MAYBE_ROMAN, replace_roman, test)
print(result)
VPfB
  • 14,927
  • 6
  • 41
  • 75
  • 2
    thank you so much for your help!!! In MAYBE_ROMAN I don't understand why we need 're.I', is it possible to explain me more? –  Jun 20 '21 at 14:24
  • @George It is an optional "ignore case". You decide, if you need the `re.I` to match also the lower case like `(iii.)` or not. – VPfB Jun 20 '21 at 19:30