How to extract uppercase and title case string sections into separate columns

Question

I'm trying to figure how to split name from surname, into two new dataframe columns.

name is always in UPPERCASE, whilst surname is in title case. Without losing information.

There are a number of Stack Overflow questions, but I'm not certain how to use them with a pandas dataframe column:

for example:

data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)

       Naam aanvrager
0     DREGGHE Joannes
1          MAHIEU Leo
2       NIEUWENHUIJSE
3             COPPENS
4  VERBURGHT Cornelis
5    NUYTTENS Adriaen
6  DE LARUELLE Pieter
7          VAN VIJVER
8      SILBO Martinus
9   STEEMAERE Anthone

the wanted output (two extra columns "Name" and "Surname"):

name	surname
DREGGHE	Joannes
MAHIEU	Leo
NIEUWENHUIJSE
COPPENS
VERBURGHT	Cornelis
NUYTTENS	Adriaen
DE LAURELLE	Pieter
VAN VIJVER
SILBO	Martinus
STEEMAERE	Anthone

score 2 · Accepted Answer · edited Feb 02 '21 at 17:51

Use pandas.Series.str.extractall, which will extract multiple capture groups in the regex pattern, as new columns.
The pattern may also extract extra whitespace, which must be removed with .str.strip()
- Without strip: df.iloc[0, 2] → 'DREGGHE '

import pandas as pd

# sample dataframe
data = {'Naam aanvrager': ['DREGGHE Joannes', 'MAHIEU Leo', 'NIEUWENHUIJSE', 'COPPENS', 'VERBURGHT Cornelis', 'NUYTTENS Adriaen', 'DE LARUELLE Pieter', 'VAN VIJVER', 'SILBO Martinus', 'STEEMAERE Anthone']}
df = pd.DataFrame(data)

# extract names
df[['First Name','Last Name']] = df['Naam aanvrager'].str.extractall(r'(\b[A-Z ]+\b)(\w+)*').reset_index()[[1,0]]

# the pattern to extract the Last Name may include extra whitespace, which can be removed as follows
df['Last Name'] = df['Last Name'].str.strip()

# display(df)
       Naam aanvrager First Name      Last Name
0     DREGGHE Joannes    Joannes        DREGGHE
1          MAHIEU Leo        Leo         MAHIEU
2       NIEUWENHUIJSE        NaN  NIEUWENHUIJSE
3             COPPENS        NaN        COPPENS
4  VERBURGHT Cornelis   Cornelis      VERBURGHT
5    NUYTTENS Adriaen    Adriaen       NUYTTENS
6  DE LARUELLE Pieter     Pieter    DE LARUELLE
7          VAN VIJVER        NaN     VAN VIJVER
8      SILBO Martinus   Martinus          SILBO
9   STEEMAERE Anthone    Anthone      STEEMAERE

score -1 · Answer 2 · answered Feb 02 '21 at 17:01

-1

I would use Regex for this:

import re
regex = r"[A-Z]{2,}"
last_name = re.findall(regex, name_line) # This will return an array you need to join
last_name = ' '.join(last_name) # Remove that from your line
name_line.replace(last_name, '').strip() # Remove any trailing spaces

answered Feb 02 '21 at 17:01

dengar81

2,485
3
18
23

You may have hyphenated names, I guess. You can add the `-` to the regex: `r"[A-Z\-]{2,}` And yes, can't be a single character last name :) – dengar81 Feb 02 '21 at 17:02

How to extract uppercase and title case string sections into separate columns

2 Answers2