0

I'm trying to figure how to split name from surname, into two new dataframe columns.

name is always in UPPERCASE, whilst surname is in title case. Without losing information.

There are a number of Stack Overflow questions, but I'm not certain how to use them with a pandas dataframe column:

for example:

data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)

       Naam aanvrager
0     DREGGHE Joannes
1          MAHIEU Leo
2       NIEUWENHUIJSE
3             COPPENS
4  VERBURGHT Cornelis
5    NUYTTENS Adriaen
6  DE LARUELLE Pieter
7          VAN VIJVER
8      SILBO Martinus
9   STEEMAERE Anthone

the wanted output (two extra columns "Name" and "Surname"):

name surname
DREGGHE Joannes
MAHIEU Leo
NIEUWENHUIJSE
COPPENS
VERBURGHT Cornelis
NUYTTENS Adriaen
DE LAURELLE Pieter
VAN VIJVER
SILBO Martinus
STEEMAERE Anthone
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158

2 Answers2

2
  • Use pandas.Series.str.extractall, which will extract multiple capture groups in the regex pattern, as new columns.
  • The pattern may also extract extra whitespace, which must be removed with .str.strip()
    • Without strip: df.iloc[0, 2] → 'DREGGHE '
import pandas as pd

# sample dataframe
data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)

# extract names
df[['First Name','Last Name']] = df['Naam aanvrager'].str.extractall(r'(\b[A-Z ]+\b)(\w+)*').reset_index()[[1,0]]

# the pattern to extract the Last Name may include extra whitespace, which can be removed as follows
df['Last Name'] = df['Last Name'].str.strip()

# display(df)
       Naam aanvrager First Name      Last Name
0     DREGGHE Joannes    Joannes        DREGGHE
1          MAHIEU Leo        Leo         MAHIEU
2       NIEUWENHUIJSE        NaN  NIEUWENHUIJSE
3             COPPENS        NaN        COPPENS
4  VERBURGHT Cornelis   Cornelis      VERBURGHT
5    NUYTTENS Adriaen    Adriaen       NUYTTENS
6  DE LARUELLE Pieter     Pieter    DE LARUELLE
7          VAN VIJVER        NaN     VAN VIJVER
8      SILBO Martinus   Martinus          SILBO
9   STEEMAERE Anthone    Anthone      STEEMAERE
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
rhug123
  • 7,893
  • 1
  • 9
  • 24
-1

I would use Regex for this:

import re
regex = r"[A-Z]{2,}"
last_name = re.findall(regex, name_line) # This will return an array you need to join
last_name = ' '.join(last_name) # Remove that from your line
name_line.replace(last_name, '').strip() # Remove any trailing spaces
dengar81
  • 2,485
  • 3
  • 18
  • 23
  • You may have hyphenated names, I guess. You can add the `-` to the regex: `r"[A-Z\-]{2,}` And yes, can't be a single character last name :) – dengar81 Feb 02 '21 at 17:02