0

I have a list of all cities US (150k+)in a pandas series like

import pandas as pd
master_city = pd.Series(['Lake Ketchum', 'Arletta', 'Glenoma', ..., 'Curlew'])

I have another series that contains a list of addressees like:

addresses = pd.series(['Headquarters 1120 N Street Lake Ketchum 916-654-5266', 'District 1 1656 Union Street Glenoma 707-445-6600', '1657 Riverside Drive Redding, CA 96001'])

I want to see if each address in addresses series contains a exact match of any of the cities in master city series. This is to validate city name is correct in all address. In this case address 1, 2 should match as it contains exact match for Lake Ketchum and Glenoma.

Can this be done with any series string method in a vectorised way ?

Mithun Manohar
  • 516
  • 1
  • 6
  • 18
  • can you paste your code which you are trying to achieve, may be that way you will get exact answer. – Karn Kumar Oct 28 '18 at 05:51
  • Since performance is your main concern, I suggest you consider specialised libraries such as [this answer](https://stackoverflow.com/a/48600345/9209546) in the marked duplicate. – jpp Oct 28 '18 at 17:42

1 Answers1

0

I think for an exact match in a not so complicated situation, you can try something like:

for item in master_city:
    matches = addresses[addresses.str.contains(item)]
    # matches is a pd series with indices of addresses matching to the `item` 
    # do whatever with matches
khan
  • 7,005
  • 15
  • 48
  • 70
  • I have a series of 5k addresses and looping over 150k cities is time consuming process. So want to know the possibility of using any vectorised operation. – Mithun Manohar Oct 28 '18 at 06:06
  • You have to provide the data at least if you are willing to seek the exact answer! – Karn Kumar Oct 28 '18 at 06:44
  • I am not able to share the data as its confidential. But the master_city, addresses in the question is representative of my exact data. – Mithun Manohar Oct 28 '18 at 07:43
  • Afaik as long as your addresses are monolithic strings, i.e. you don't have access to/can't create a version, which provides street names, city names, zip codes etc. in separate columns of a dataset, you'll have to stick with looping... – SpghttCd Oct 28 '18 at 08:37