How to split a string address column using regex in pandas

Question

I have the following data frame which contains address column,

df = pd.DataFrame(index=np.arange(10))
df["address"] = "Iso Omena 8 a 2"

need to split it to different column so that resulting dataframe would be like:

address          street_name  building_number door_number_letter appartment_numner
Iso Omena 8 a 2  Iso Omena    8                  a                2

what makes it tricky is that:

1.names may have or have not space between them like above example.

2.door_number_letter might be sometimes number not letter. (eg. "Iso Omena 8 5 2" )

address most complete form is :[address,street_name, building_number,door_number_letter,appartment_numner]

Also, some times building number and door number may appear before street name. Or do you have a fixed pattern completely of how they appear? — Ankur Sinha, May 15 '18 at 09:06
@trollster unfortunately i don't have any pattern, but above form is most complete format that address can appear, it always starts with street name, but as i mentioned door_number_letter might be sometimes number not letter — chessosapiens, May 15 '18 at 09:09
But street names do not contain numbers right? So greedy searching for everything which contains lower or uppercase letters including white space would return the street - rest of it is space separated building/door/apartment, right? — SpghttCd, May 15 '18 at 09:12

SpghttCd · Accepted Answer · 2019-06-11T05:48:33.667

Supposed address is letters and spaces only and rest is space separated while building number always starts with a number, this could be achieved the following way:

import re
s = ['Iso Omena 8 a 2', 'Xstreet 2', 'Isö Ømenå 8 a 2']
for addr in s:
    street = re.findall('[^\d]*', addr)[0].strip()
    rest = addr[len(street):].strip().split(' ')
    print(street, rest)

# Iso Omena ['8', 'a', '2']
# Xstreet ['2']
# Isö Ømenå ['8', 'a', '2']

Or if you want to have everything in one dataframe:

df = pd.DataFrame()

df['address'] = ['Iso Omena 8 a 2', 'Xstreet 2', 'Asdf 7 c', 'Isö Ømenå 8 a 2']
df['street'] = None; df['building'] = None; df['door'] = None; df['appartment'] = None
import re
for i, s in enumerate(df['address']):
    street = re.findall('[^\d]*', s)[0].strip()
    df.loc[i,('street')] = street
    for col, val in zip(['building', 'door', 'appartment'], s[len(street):].strip().split(' ')):
        df.loc[i,(col)] = val

#            address     street building  door appartment
# 0  Iso Omena 8 a 2  Iso Omena        8     a          2     
# 1        Xstreet 2    Xstreet        2  None       None     
# 2         Asdf 7 c       Asdf        7     c       None    
# 3  Isö Ømenå 8 a 2  Isö Ømenå        8     a          2

EDIT: Building number only left of '-'sign:

you could just replace df.loc[i,(col)] = val by

df.loc[i,(col)] = re.findall('[^-]*', val)[0]

if this suits also door and appartment. Otherwise you'd have to if-test against col=='building' to only then use this version.

is it possible to change the code in a way it can accept different addresses, for example sometimes we might have only street name and address, for example:`Xstreet 2` — chessosapiens, May 16 '18 at 08:10
Perfect! thank you. In case we sometimes see this pattern :`Xstreet 1-3` how can we split the building number in a way that we throw away number after `-`? in this case `3` and only keep `1`. — chessosapiens, May 17 '18 at 08:05
there also some scnadinavian characters like ä,ö in the address above approach is not working properly with this characters, for example: ´Isö Omenä 8 a 1´ — chessosapiens, May 17 '18 at 10:02

score 2 · Answer 2 · answered May 15 '18 at 09:38

You can use:

In [116]: s1 = df.address.str.findall(r'([\w ]+?) +(\d+) +([\d\w]+) +(\d+)').map(lambda s: s[0])

In [117]: s1
Out[117]: 
0    (Iso Omena, 8, a, 2)
1    (Iso Omena, 8, a, 2)
2    (Iso Omena, 8, a, 2)
3    (Iso Omena, 8, a, 2)
4    (Iso Omena, 8, a, 2)
5    (Iso Omena, 8, a, 2)
6    (Iso Omena, 8, a, 2)
7    (Iso Omena, 8, a, 2)
8    (Iso Omena, 8, a, 2)
9    (Iso Omena, 8, a, 2)
Name: address, dtype: object

Then construct a dataframe based on these columns:

In [118]: pd.DataFrame(s1.values.tolist(), index=s1.index, columns=['street_name', 'building_number', 'door_number_letter', 'appartment_numner'])
Out[118]: 
  street_name building_number door_number_letter appartment_numner
0   Iso Omena               8                  a                 2
1   Iso Omena               8                  a                 2
2   Iso Omena               8                  a                 2
3   Iso Omena               8                  a                 2
4   Iso Omena               8                  a                 2
5   Iso Omena               8                  a                 2
6   Iso Omena               8                  a                 2
7   Iso Omena               8                  a                 2
8   Iso Omena               8                  a                 2
9   Iso Omena               8                  a                 2

score 2 · Answer 3 · answered May 15 '18 at 10:16

Taking some inspiration from this answer I came up with this regex+extract solution:

In [77]: df.address.iloc[1] = 'Big Apple 19 21 7'

In [78]: df.address.str.extract('(?P<street>^[^0-9]*) (?P<building>.+?) (?P<door>.+?) (?P<apartment>.+?$)')

Out[78]: 
  street building door apartment
0  Iso Omena        8    a         2
1  Big Apple       19   21         7    
2  Iso Omena        8    a         2
3  Iso Omena        8    a         2 
4  Iso Omena        8    a         2
5  Iso Omena        8    a         2
6  Iso Omena        8    a         2
7  Iso Omena        8    a         2
8  Iso Omena        8    a         2
9  Iso Omena        8    a         2

Actually not always we have the most complete format, there might be sometimes only street name and building number , for example: Xstreet 2 , or very rarely a fifth element as well. for example: `Xstret 2 a 1 77` so what i am trying to say is that not always we have four address elements — chessosapiens, May 16 '18 at 08:06

score 1 · Answer 4 · answered May 15 '18 at 09:33

1

Something like this?

import re

addr = "Iso Omena 8 a 2"

pattern = r'[a-öA-Ö]{3,100} *[a-öA-Ö]{3,100}'
street = re.findall(pattern, addr)[0]

bda = addr[len(street):].split()
print(street, bda,addr[len(street):])

answered May 15 '18 at 09:33

Mika72

413
2
12

How to split a string address column using regex in pandas

4 Answers4