1

Hello I have a string of full names.

string='Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'

I would like to split it by first and last name to have an output like this

['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']

I tried using this code:

splitted = re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', string))

that returns this result

['Christof', 'Koch', 'Jonathan', 'Harel', 'Moran', 'Cerf', 'Wolfgang', 'Einhaeuser']

I would like to have each full name as an item.

Any suggestions? Thanks

leena
  • 563
  • 1
  • 8
  • 25

2 Answers2

5

You can use a lookahead after any lowercase to see if it's followed by an immediate uppercase or end-of-line such as [a-zA-Z\s]+?[a-z](?=[A-Z]|$) (more specific) or even .+?[a-z](?=[A-Z]|$) (more broad).

import re

string = 'Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser'

print(re.findall(r".+?[a-z](?=[A-Z]|$)", string)) 
# -> ['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']

Having provided this answer, definitely check out Falsehoods Programmers Believe About Names; depending on your data, it might be erroneous to assume that your format will be parseable using the lower->upper assumption.


For your list of strings in this format from the comments, just add a list comprehension. The regex I provided above happens to be robust to the middle initials without modification (but I have to emphasize that if your dataset is enormous, that might not hold).

import re

names = ['Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser', 'Za?d HarchaouiC?line Levy-leduc', 'David A. ForsythDuan Tran', 'Arnold SmeuldersSennay GhebreabPieter Adriaans', 'Peter L. BartlettAmbuj Tewari', 'Javier R. MovellanPaul L. RuvoloIan Fasel', 'Deli ZhaoXiaoou Tang']

result = [re.findall(r".+?[a-z](?=[A-Z]|$)", x) for x in names]

for name in result:
    print(name)

Output:

['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']
['Za?d Harchaoui', 'C?line Levy-leduc']
['David A. Forsyth', 'Duan Tran']
['Arnold Smeulders', 'Sennay Ghebreab', 'Pieter Adriaans']
['Peter L. Bartlett', 'Ambuj Tewari']
['Javier R. Movellan', 'Paul L. Ruvolo', 'Ian Fasel']
['Deli Zhao', 'Xiaoou Tang']

And if you want all of these names in one list, add

flattened = [x for y in result for x in y]
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Thank you, is there a way to apply this to a long list of names like this one? `````['Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser', 'Za?d HarchaouiC?line Levy-leduc', 'David A. ForsythDuan Tran', 'Arnold SmeuldersSennay GhebreabPieter Adriaans', 'Peter L. BartlettAmbuj Tewari', 'Javier R. MovellanPaul L. RuvoloIan Fasel', 'Deli ZhaoXiaoou Tang']````` – leena Sep 13 '19 at 01:24
  • 1
    OK, that list has a lot of additional complexity not mentioned in your original post. Can you update the post to reflect all of the edge cases you're going to encounter? See the link in my post--this is a slippery slope. Having said that, I think `[re.findall(r".+?[a-z](?=[A-Z]|$)", x) for x in names]` still works for this (same regex, add a list comprehension). – ggorlen Sep 13 '19 at 01:25
  • Yes, `Alcide d'Orbigny` for example. https://en.wikipedia.org/wiki/Alcide_d%27Orbigny The `d` is capitalized only at the beginning of a sentence it seems. – Mark Sep 13 '19 at 01:40
1

It'll most likely have FP and TN, yet maybe OK to start with:

[A-Z][^A-Z\r\n]*\s+[A-Z][^A-Z\r\n]*

Test

import re

expression = r"[A-Z][^A-Z]*\s+[A-Z][^A-Z]*"

string = """

Christof KochJonathan HarelMoran CerfWolfgang Einhaeuser
"""

print(re.findall(expression, string))

Output

 ['Christof Koch', 'Jonathan Harel', 'Moran Cerf', 'Wolfgang Einhaeuser']

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


Emma
  • 27,428
  • 11
  • 44
  • 69