1

I am trying to split a string, where multi-word proper nouns are recognized as one token. For example, the following code needs to be changed,

import re

s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
out = re.compile("\s").split(s)

print(out)

in order to get this desired outcome:

['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']

I have found this, but I am not able to incorporate it to the code correctly.

Thanks in advance!

user2864740
  • 60,010
  • 15
  • 145
  • 220
Walter
  • 79
  • 8
  • 1
    You could get the matches instead of split `[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+` https://regex101.com/r/iUFDFf/1 but is based on the presence of repeating uppercase chars from the words before and only uppercase chars within the parenthesis. See https://ideone.com/SBzheb – The fourth bird Oct 21 '20 at 06:54

1 Answers1

1

You could match consecutive words starting with an uppercase char followed by 1+ lowercase chars with either a space or - in between to get a single match for Multi-Criteria Decision Making.

To match the other words, you can use an alternation | to match 1 or more word characters.

[A-Z][a-z]+(?:[ -][A-Z][a-z]+)*|\w+

Regex demo


If there should be a part following with 2 or more uppercase chars between parenthesis, you could use a positive lookahead.

Note that the lookahead only checks for the presence of uppercase chars, it does not match the exact same uppercase chars from the preceding words.

[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+

Regex demo | Python demo

import re
 
s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
pattern = r'[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+'
print(re.findall(pattern, s))

Output

['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Maybe https://ideone.com/XuUop0 will be safer. Based on [how to match abbreviations with their meaning with regex?](https://stackoverflow.com/a/63634264/3832970) – Wiktor Stribiżew Oct 21 '20 at 08:43
  • @WiktorStribiżew Yes using a backreference for to match at least the first uppercase char will be a bit safer indeed. Nice solution :-) ++ – The fourth bird Oct 21 '20 at 09:00