TL;DR: Split by space with look-arounds: re.compile(r'(?<!,) (?=[A-Z])').split(text)
, details in section 4.
0. Try your regex
Try your regex on regex101 or regexplanet (with Python flavour).
See the demo:
\b(?<!,)[A-Z]
Uses:
\b
word-boundary (a whitespace or comma or dot, etc.)
(?<!,)
negative lookbehind: the following should not be preceeded by a comma
[A-Z]
character-range of uppercase letters (A-Z)
1. Split
In Python:
import re
text = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"
words = re.split(r'\b(?<!,)[A-Z]', text)
print(words)
Prints:
['', 'arie ', "utto l'anno ", 'essuna ', 'riennale,Magistrale ', 'talia ', 'lta']
OK, some empty-string in the result.
Whoops, the capital letters got swallowed ... let's fix the splitter. It should not include the [A-Z] as delimiter:
In Python, how do I split a string and keep the separators?
2. Find all
re.findall(r'\b(?<!,)[A-Z][^A-Z]*', text)
Prints:
['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,', 'Italia ', 'Alta']
But swallows all after the comma (e.g. the Magistrale
is missing in result Triennale,
instead expected Triennale,Magistrale
).
Split a string at uppercase letters
3. Substitute and split
with both lookbehind and lookahead and NUL special-char \0
as delimiter.
re.sub(r'(?<!,)\b(?=[A-Z])', '\0', text).split('\0')
Prints:
['', 'Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,Magistrale ', 'Italia ', 'Alta']
Still some empty-string in the result.
Python regex lookbehind and lookahead
4. Split as intended
re.compile(r'(?<!,) (?=[A-Z])').split(text)
Prints:
['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']
Now all spaces are swallowed (e.g. results in Varie
instead Varie
) because this was the delimiter.