How to extract city,state from a column in pandas dataframe?

Question

I am working on a raw data and trying to extract city,state from an address column.

Address
xxx Richardson, TX
yyy Plano, TX
xxyy Wylie, TX WO-65758
zzz Waxahachie, TX WO-999786

I used splitting the last two elements in the column but how to go about for data like the 3rd and 4th row in a huge dataset containing 30k records?

Any number of ways. Having worked with (fought) addresses, and without and decent example, there probably isn't a reasonable way to answer this as it is (real-world addresses, especially if free-text, are sent to test us). If your example is reasonably simple, please see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit accordingly. — roganjosh, Feb 07 '18 at 23:36

score 1 · Answer 1 · answered Feb 08 '18 at 15:36

1

Rather than reinvent the wheel, I'd look into using an existing address-parsing library. There are more than one, so you may need to do some comparisons. https://github.com/datamade/usaddress is one I've used in the past.

answered Feb 08 '18 at 15:36

Silenced Temporarily

944
9
21

pault · Answer 2 · 2018-02-08T14:49:13.430

Option 1: Split on comma

Could it be as simple as splitting the string on a comma, and then taking the last token/first token around the split?

addresses = ["xxx Richardson, TX", "xxyy Wylie, TX WO-65758"]
for a in addresses:
    asplit = a.split(",")
    city = asplit[0].split()[-1]
    state = asplit[1].split()[0]
    print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX

Example

If you had the following DataFrame:

df = pd.DataFrame(
    {
        'Address': [
            'xxx Richardson, TX',
            'yyy Plano, TX',
            'xxyy Wylie, TX WO-65758',
            'zzz Waxahachie, TX WO-999786'
        ]
    }
)

You can define the split function:

def extract_city_state(a):
    asplit = a.split(",")
    city = asplit[0].split()[-1]
    state = asplit[1].split()[0]
    return city, state

Then apply() it to the address column, which would return two new columns, and join() this back to the original DataFrame:

df.join(
    df['Address'].apply(
        lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
    )
)
#                        Address        City State
#0            xxx Richardson, TX  Richardson    TX
#1                 yyy Plano, TX       Plano    TX
#2       xxyy Wylie, TX WO-65758       Wylie    TX
#3  zzz Waxahachie, TX WO-999786  Waxahachie    TX

Option 2: Use Regex

If that doesn't work, how about matching using a regex pattern?

This one should work:

import re    
pattern = r"[A-Z][a-z]+,\s[A-Z]{2}"
for a in addresses:
  matches = re.finditer(pattern, a, re.MULTILINE)
  for match in matches:
      city, state = match.group().replace(",", "").split()
      print(", ".join([city, state])) 
#Richardson, TX
#Wylie, TX

Which matches:

[A-Z]: One capital letter
[a-z]+: Any number of lower case letters
,\s: Comma followed by a space
[A-Z]{2}: 2 Capital Letters

Demo on Regex101

Example

df.join(
    df['Address'].str.extract(
        r"((?P<City>[A-Z][a-z]+),\s(?P<State>[A-Z]{2}))",
        expand=False
    )[["City", "State"]]
)
#                        Address        City State
#0            xxx Richardson, TX  Richardson    TX
#1                 yyy Plano, TX       Plano    TX
#2       xxyy Wylie, TX WO-65758       Wylie    TX
#3  zzz Waxahachie, TX WO-999786  Waxahachie    TX

Notes

This won't work for city names that have a space in them, i.e. "San Antonio, TX".

Lambda · Accepted Answer · 2018-02-09T03:44:07.347

0

I don't quite understand what you want to get, just split the column and get the last two elements as the city and state? Maybe this code below can help you.

df["Address"].apply(lambda x: "".join(x.split()[1:]))

update: (I changed the data of row 2 to make it contain a space)

df2 = df["Address"].apply(lambda x: x.split(","))
city = df2.apply(lambda x: " ".join(x[0].split()[1:]))
state = df2.apply(lambda x: x[1].split()[0])
result = pd.DataFrame(zip(city, state), columns=["city", "state"])

result:

Out[13]:
     city state
0  Richardson    TX
1   Pla Plano    TX
2       Wylie    TX
3  Waxahachie    TX

edited Feb 09 '18 at 03:44

answered Feb 08 '18 at 07:00

Lambda

1,392
1
9
11

This doesn't work on the last 2 examples where there is extraneous data at the end of the string. OP specifically mentioned that they tried this but it didn't work. – pault Feb 08 '18 at 14:46
How do you define extraneous data? – Lambda Feb 09 '18 at 03:12
OP wants city and state. Your solution on the last string `"zzz Waxahachie, TX WO-999786"` returns `"Waxahachie,TXWO-999786"`. – pault Feb 09 '18 at 03:15
I mean you should tell us what you want to get, such as "TX WO-999786" should return "TX". I update my answer, hope it can help you. – Lambda Feb 09 '18 at 03:48
Sure this works now but it's essentially the same thing as my answer. I think _"trying to extract city,state"_ was pretty clear. – pault Feb 09 '18 at 13:58

How to extract city,state from a column in pandas dataframe?

3 Answers3

Option 1: Split on comma

Option 2: Use Regex