Split sentence by “words”, treating multiple capital words (assumed to be proper nouns) as one

Question

I am trying to split a string, where multi-word proper nouns are recognized as one token. For example, the following code needs to be changed,

import re

s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
out = re.compile("\s").split(s)

print(out)

in order to get this desired outcome:

['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']

I have found this, but I am not able to incorporate it to the code correctly.

Thanks in advance!

You could get the matches instead of split `[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+` https://regex101.com/r/iUFDFf/1 but is based on the presence of repeating uppercase chars from the words before and only uppercase chars within the parenthesis. See https://ideone.com/SBzheb — The fourth bird, Oct 21 '20 at 06:54

score 1 · Accepted Answer · answered Oct 21 '20 at 08:34

You could match consecutive words starting with an uppercase char followed by 1+ lowercase chars with either a space or - in between to get a single match for Multi-Criteria Decision Making.

To match the other words, you can use an alternation | to match 1 or more word characters.

[A-Z][a-z]+(?:[ -][A-Z][a-z]+)*|\w+

Regex demo

If there should be a part following with 2 or more uppercase chars between parenthesis, you could use a positive lookahead.

Note that the lookahead only checks for the presence of uppercase chars, it does not match the exact same uppercase chars from the preceding words.

[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+

Regex demo | Python demo

import re
 
s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
pattern = r'[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+'
print(re.findall(pattern, s))

Output

['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']

Maybe https://ideone.com/XuUop0 will be safer. Based on [how to match abbreviations with their meaning with regex?](https://stackoverflow.com/a/63634264/3832970) — Wiktor Stribiżew, Oct 21 '20 at 08:43
@WiktorStribiżew Yes using a backreference for to match at least the first uppercase char will be a bit safer indeed. Nice solution :-) ++ — The fourth bird, Oct 21 '20 at 09:00

Split sentence by “words”, treating multiple capital words (assumed to be proper nouns) as one

1 Answers1