Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them

Question

I am trying to split a phrase into words.

Example

Input is:

Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta

Expected output is:

['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale, Magistrale ', 'Italia ', 'Alta']

Disclaimer

I successfully managed to split the phrase into words when there are capital letters, however I realized that some of them have a comma in between (and I actually need to keep that together) so I would like to exclude the splitting in that case. I am just starting with this kind of stuff and this re.split function is really confusing.

What I tried

I have tried the following:

re.split('(?=[A-Z])+|[,]',text )

However this doesn't do exactly what I what, because it separates also when it encounters a comma, but I am trying to do the opposite.

How can I do that?

input is [Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta], expected output is: ['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale, Magistrale ', 'Italia ', 'Alta'] — mlp, Feb 15 '22 at 20:50

hc_dev · Answer 1 · 2022-02-15T22:16:07.507

TL;DR: Split by space with look-arounds: re.compile(r'(?<!,) (?=[A-Z])').split(text), details in section 4.

0. Try your regex

Try your regex on regex101 or regexplanet (with Python flavour).

See the demo:

\b(?<!,)[A-Z]

Uses:

\b word-boundary (a whitespace or comma or dot, etc.)
(?<!,) negative lookbehind: the following should not be preceeded by a comma
[A-Z] character-range of uppercase letters (A-Z)

1. Split

In Python:

import re

text = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"

words = re.split(r'\b(?<!,)[A-Z]', text)
print(words)

Prints:

['', 'arie ', "utto l'anno ", 'essuna ', 'riennale,Magistrale ', 'talia ', 'lta']

OK, some empty-string in the result.

Whoops, the capital letters got swallowed ... let's fix the splitter. It should not include the [A-Z] as delimiter:

In Python, how do I split a string and keep the separators?

2. Find all

re.findall(r'\b(?<!,)[A-Z][^A-Z]*', text)

Prints:

['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,', 'Italia ', 'Alta']

But swallows all after the comma (e.g. the Magistrale is missing in result Triennale, instead expected Triennale,Magistrale).

Split a string at uppercase letters

3. Substitute and split

with both lookbehind and lookahead and NUL special-char \0 as delimiter.

re.sub(r'(?<!,)\b(?=[A-Z])', '\0', text).split('\0')

Prints:

['', 'Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,Magistrale ', 'Italia ', 'Alta']

Still some empty-string in the result.

Python regex lookbehind and lookahead

4. Split as intended

re.compile(r'(?<!,) (?=[A-Z])').split(text)

Prints:

['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']

Now all spaces are swallowed (e.g. results in Varie instead Varie ) because this was the delimiter.

score 0 · Answer 2 · answered Feb 15 '22 at 21:21

0

I firmly believe that what I have come up with is not the best idea, but a solution to your problem. This is what I came up with:

import re
input = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"
output = re.sub('(\s|^)([A-Z][^A-Z]*)($|\s)', r'\n\2\n',input)
output = re.split("\n", output)
output = list(filter(None, output))

output

['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']

answered Feb 15 '22 at 21:21

TheFaultInOurStars

3,464
1
8
29

1

The __substitute and split__ is a solid and robust approach. – hc_dev Feb 15 '22 at 21:25
@hc_dev Yeah, and so lengthy! There should be a shorter approach, I guess. – TheFaultInOurStars Feb 15 '22 at 21:27

Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them

Example

Disclaimer

What I tried

2 Answers2

0. Try your regex

1. Split

2. Find all

3. Substitute and split

4. Split as intended

output