3

I am working on a raw data and trying to extract city,state from an address column.

Address
xxx Richardson, TX
yyy Plano, TX
xxyy Wylie, TX WO-65758
zzz Waxahachie, TX WO-999786

I used splitting the last two elements in the column but how to go about for data like the 3rd and 4th row in a huge dataset containing 30k records?

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Aruna J
  • 33
  • 1
  • 3
  • 4
    Any number of ways. Having worked with (fought) addresses, and without and decent example, there probably isn't a reasonable way to answer this as it is (real-world addresses, especially if free-text, are sent to test us). If your example is reasonably simple, please see [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit accordingly. – roganjosh Feb 07 '18 at 23:36
  • what's the difference between 3rd and 4th type of address? – MaxU - stand with Ukraine Feb 07 '18 at 23:39

3 Answers3

1

Rather than reinvent the wheel, I'd look into using an existing address-parsing library. There are more than one, so you may need to do some comparisons. https://github.com/datamade/usaddress is one I've used in the past.

0

Option 1: Split on comma

Could it be as simple as splitting the string on a comma, and then taking the last token/first token around the split?

addresses = ["xxx Richardson, TX", "xxyy Wylie, TX WO-65758"]
for a in addresses:
    asplit = a.split(",")
    city = asplit[0].split()[-1]
    state = asplit[1].split()[0]
    print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX

Example

If you had the following DataFrame:

df = pd.DataFrame(
    {
        'Address': [
            'xxx Richardson, TX',
            'yyy Plano, TX',
            'xxyy Wylie, TX WO-65758',
            'zzz Waxahachie, TX WO-999786'
        ]
    }
)

You can define the split function:

def extract_city_state(a):
    asplit = a.split(",")
    city = asplit[0].split()[-1]
    state = asplit[1].split()[0]
    return city, state

Then apply() it to the address column, which would return two new columns, and join() this back to the original DataFrame:

df.join(
    df['Address'].apply(
        lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
    )
)
#                        Address        City State
#0            xxx Richardson, TX  Richardson    TX
#1                 yyy Plano, TX       Plano    TX
#2       xxyy Wylie, TX WO-65758       Wylie    TX
#3  zzz Waxahachie, TX WO-999786  Waxahachie    TX

Option 2: Use Regex

If that doesn't work, how about matching using a regex pattern?

This one should work:

import re    
pattern = r"[A-Z][a-z]+,\s[A-Z]{2}"
for a in addresses:
  matches = re.finditer(pattern, a, re.MULTILINE)
  for match in matches:
      city, state = match.group().replace(",", "").split()
      print(", ".join([city, state])) 
#Richardson, TX
#Wylie, TX

Which matches:

  • [A-Z]: One capital letter
  • [a-z]+: Any number of lower case letters
  • ,\s: Comma followed by a space
  • [A-Z]{2}: 2 Capital Letters

Demo on Regex101


Example

df.join(
    df['Address'].str.extract(
        r"((?P<City>[A-Z][a-z]+),\s(?P<State>[A-Z]{2}))",
        expand=False
    )[["City", "State"]]
)
#                        Address        City State
#0            xxx Richardson, TX  Richardson    TX
#1                 yyy Plano, TX       Plano    TX
#2       xxyy Wylie, TX WO-65758       Wylie    TX
#3  zzz Waxahachie, TX WO-999786  Waxahachie    TX

Notes

  • This won't work for city names that have a space in them, i.e. "San Antonio, TX".
pault
  • 41,343
  • 15
  • 107
  • 149
0

I don't quite understand what you want to get, just split the column and get the last two elements as the city and state? Maybe this code below can help you.

df["Address"].apply(lambda x: "".join(x.split()[1:]))

update: (I changed the data of row 2 to make it contain a space)

df2 = df["Address"].apply(lambda x: x.split(","))
city = df2.apply(lambda x: " ".join(x[0].split()[1:]))
state = df2.apply(lambda x: x[1].split()[0])
result = pd.DataFrame(zip(city, state), columns=["city", "state"])

result:

Out[13]:
     city state
0  Richardson    TX
1   Pla Plano    TX
2       Wylie    TX
3  Waxahachie    TX
Lambda
  • 1,392
  • 1
  • 9
  • 11
  • This doesn't work on the last 2 examples where there is extraneous data at the end of the string. OP specifically mentioned that they tried this but it didn't work. – pault Feb 08 '18 at 14:46
  • How do you define extraneous data? – Lambda Feb 09 '18 at 03:12
  • OP wants city and state. Your solution on the last string `"zzz Waxahachie, TX WO-999786"` returns `"Waxahachie,TXWO-999786"`. – pault Feb 09 '18 at 03:15
  • I mean you should tell us what you want to get, such as "TX WO-999786" should return "TX". I update my answer, hope it can help you. – Lambda Feb 09 '18 at 03:48
  • Sure this works now but it's essentially the same thing as my answer. I think _"trying to extract city,state"_ was pretty clear. – pault Feb 09 '18 at 13:58