Split string at capital letter but only if no whitespace

Question

Set-up

I've got a string of names which need to be separated into a list.

Following this answer, I have,

string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)

where the last line gives me,

['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']

Problems

1) Whitespace is ignored

'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.

What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?

2) Special characters not handled well

The code used cannot handle 'ö'. How do I include such 'German' characters?

I.e. I want to obtain,

['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

score 3 · Accepted Answer · answered Nov 27 '17 at 10:50

You can use positive and negative lookbehind and just list the Umlauts explicitly:

>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

(?<!\s)...: matches ... that is not preceded by \s

(?<=\s)...: matches ... that is preceded by \s

(?:...): non-capturing group so as to not mess with the findall results

score 0 · Answer 2 · answered Nov 27 '17 at 10:56

0

This works

string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

answered Nov 27 '17 at 10:56

optimalic

511
3
17

Split string at capital letter but only if no whitespace

2 Answers2