2

I have two location-based pandas DataFrames.

df1: Which has a column that consists of a full address, such as "Avon Road, Ealing, London, UK". The address varies in format.

df1.address[0] --> "Avon Road, Ealing, London, UK"

df2: Which just has cities of UK, such as "London".

df2.city[5] --> "London"

I want to locate the city of the first dataframe, given the full address. This would go on my first dataframe as such.

df1.city[0] --> "London"

Approach 1: For each city in df2, check if df1 has those cities and stores the indexs of df1 and the city of df2 in a list.

I am not certain how i would go about doing this, but I assume i would use this code to figure out if there is a partial string match and locate the index's:

df1['address'].str.contains("London",na=False).index.values  

Approach 2: For each df1 address, check if any of the words match the cities in df2 and store the value of df2 in a list.

I would assume this approach is more intuitive, but would it be computationally more expensive? Assume df1 has millions of addresses.

Apologies if this is a stupid or easy problem! Any direction to the most efficient code would be helpful :)

Meruem
  • 33
  • 4
  • See https://stackoverflow.com/questions/54756025/how-to-merge-pandas-on-string-contains – Tarik Jun 21 '20 at 20:00
  • for the second approah, see this [answer](https://stackoverflow.com/a/53043350/9274732) with `str.extract`, in which L would be for you `L=df2.city.tolist()` or something like this – Ben.T Jun 21 '20 at 20:07
  • How many rows `df1` and `df2` contains approximately? – Jérôme Richard Jun 21 '20 at 20:09
  • @JérômeRichard df1 contains around 300,000 rows (growing) and df2 contains 240 rows (fixed). – Meruem Jun 21 '20 at 20:24

1 Answers1

2

Approach 2 is indeed a good start. However, using a Python dictionary rather than a list should be much faster. Here is an example code:

cityIndex = set(df2.city)

addressLocations = []
for address in df1.address:
    location = None
    # Warning: ignore characters like '-' in the cities
    for word in re.findall(r'[a-zA-Z0-9]+', address):
        if word in cityIndex:
            location = word
            break
    addressLocations.append(location)
df1['city'] = addressLocations
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59