2

Set-up

I've got a string of names which need to be separated into a list.

Following this answer, I have,

string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)

where the last line gives me,

['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']

Problems

1) Whitespace is ignored

'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.

What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?

2) Special characters not handled well

The code used cannot handle 'ö'. How do I include such 'German' characters?

I.e. I want to obtain,

['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
LucSpan
  • 1,831
  • 6
  • 31
  • 66

2 Answers2

3

You can use positive and negative lookbehind and just list the Umlauts explicitly:

>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

(?<!\s)...: matches ... that is not preceded by \s

(?<=\s)...: matches ... that is preceded by \s

(?:...): non-capturing group so as to not mess with the findall results

user2390182
  • 72,016
  • 6
  • 67
  • 89
0

This works

string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']
optimalic
  • 511
  • 3
  • 17