Option 1: Split on comma
Could it be as simple as splitting the string on a comma, and then taking the last token/first token around the split?
addresses = ["xxx Richardson, TX", "xxyy Wylie, TX WO-65758"]
for a in addresses:
asplit = a.split(",")
city = asplit[0].split()[-1]
state = asplit[1].split()[0]
print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX
Example
If you had the following DataFrame:
df = pd.DataFrame(
{
'Address': [
'xxx Richardson, TX',
'yyy Plano, TX',
'xxyy Wylie, TX WO-65758',
'zzz Waxahachie, TX WO-999786'
]
}
)
You can define the split function:
def extract_city_state(a):
asplit = a.split(",")
city = asplit[0].split()[-1]
state = asplit[1].split()[0]
return city, state
Then apply()
it to the address column, which would return two new columns, and join()
this back to the original DataFrame:
df.join(
df['Address'].apply(
lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
)
)
# Address City State
#0 xxx Richardson, TX Richardson TX
#1 yyy Plano, TX Plano TX
#2 xxyy Wylie, TX WO-65758 Wylie TX
#3 zzz Waxahachie, TX WO-999786 Waxahachie TX
Option 2: Use Regex
If that doesn't work, how about matching using a regex pattern?
This one should work:
import re
pattern = r"[A-Z][a-z]+,\s[A-Z]{2}"
for a in addresses:
matches = re.finditer(pattern, a, re.MULTILINE)
for match in matches:
city, state = match.group().replace(",", "").split()
print(", ".join([city, state]))
#Richardson, TX
#Wylie, TX
Which matches:
[A-Z]
: One capital letter
[a-z]+
: Any number of lower case letters
,\s
: Comma followed by a space
[A-Z]{2}
: 2 Capital Letters
Demo on Regex101
Example
df.join(
df['Address'].str.extract(
r"((?P<City>[A-Z][a-z]+),\s(?P<State>[A-Z]{2}))",
expand=False
)[["City", "State"]]
)
# Address City State
#0 xxx Richardson, TX Richardson TX
#1 yyy Plano, TX Plano TX
#2 xxyy Wylie, TX WO-65758 Wylie TX
#3 zzz Waxahachie, TX WO-999786 Waxahachie TX
Notes
- This won't work for city names that have a space in them, i.e. "San Antonio, TX".