0

Let's say I have some customer data. The data is generated by the customer and it's messy, so they put their city into either the city or county field or both! That means I may need to check both columns to find out which city they are from.

mydf = pd.DataFrame({'name':['jim','jon'],
                  'city':['new york',''],
                    'county':['','los angeles']})

print(mydf)

  name      city       county
0  jim  new york             
1  jon            los angeles

And I am using an api to get their zipcode. There is a different api function for each city, and it returns the zipcode for the customer's address, e.g. 123 main stret, new york. I haven't included the full address here to save time.

# api for new york addresses
def get_NY_zipcode_api():
    return 'abc123'

# api for chicago addresses
def get_CH_zipcode_api():
    return 'abc124'

# api for los angeles addresses
def get_LA_zipcode_api():
    return 'abc125'

# api for miami addresses
def get_MI_zipcode_api():
    return 'abc126'

Depending on the city, I will call a different api. So for now, I am checking if city == x or county ==x, call api_x:

def myfunc(row):

    city = row['city']
    county = row['county']

    if city == 'chicago' or county == 'chicago':

        # call chicago api
        zipcode = get_CH_zipcode_api()

        return zipcode

    elif city == 'new york' or county == 'new york':

        # call new york api
        zipcode = get_NY_zipcode_api()

        return zipcode

    elif city == 'los angeles' or county == 'los angeles':

        # call los angeles api        
        zipcode = get_LA_zipcode_api()

        return zipcode

    elif city == 'miami' or county == 'miami':

        # call miami api        
        zipcode = get_MI_zipcode_api()

        return zipcode       

And I apply() this to the df and get my results:

mydf['result'] = mydf.apply(myfunc,axis=1)

print(mydf)

  name      city       county  result
0  jim  new york               abc123
1  jon            los angeles  abc125

I actually have about 30 cities and therefore 30 conditions to check, so I want to avoid a long list of elif statments. What would be the most efficient way to do this?

I found some suggestions from a similar stack overflow question. Such as creating a dictionary with key:city and value:function and calling it based on city:

operationFuncs = {
    'chicago': get_CH_zipcode_api,
    'new york': get_NY_zipcode_api,
    'los angeles': get_LA_zipcode_api,
    'miami': get_MI_zipcode_api
}

But as far as I can see this only works if I am checking a single column / single condition. I can't see how it can work with if city == x or county == x

SCool
  • 3,104
  • 4
  • 21
  • 49
  • What about first creating a column merging the city and county and then use this new column to use the suggested dict? – Adirio Sep 04 '19 at 13:48
  • 2
    You can also use something like this: `func = operationFuncs[city] if city in operationFuncs.keys() else operationFuncs[county]` – Artog Sep 04 '19 at 13:50
  • or `key = city if city != "" else county` – Artog Sep 04 '19 at 13:53

5 Answers5

0
mydf['result'] = mydf.apply(lambda row : operationFuncs.get(row['county']) or operationFuncs.get(row['city']),axis=1)
0

I think you are referring to this. You can just perform this operation twice for city and county and save the result in two different variables, for each zipcode respectively. You can then compare the results and decide what to do if they differ (I am not sure if this can be the case with your dataset).

Since the dictionary-lookup is in O(1) and I assume your get_MI_zipcode_api() isn't any more expensive, this will have no real performance-drawbacks.

3ch0
  • 173
  • 1
  • 7
0

Maybe not the most elegant solution but you could use the dict approach and just call it twice, once on city and once on county. The second would overwrite the first but the same is true of your if block, and this would only be a problem if you had city='New York' county ='Chicago' for example which I assume cannot occur.

Or you could use the dict and iterate through it, this seems unnecessary though.

For key, fn in fdict:
    if key in (city,county):
        fn()

I'd do this join in SQL before reading in the data, I'm sure there's a way to do the same in Pandas, but I was trying to make suggestions that build on your existing research even if they are not the best.

Simon Notley
  • 2,070
  • 3
  • 12
  • 18
0

If it's guaranteed that the value will either be present in city or country & not in both, then you can merge both the columns together into one.

df['region'] = df['City'] + '' + df['Country']

Then create a mapping of region and pincode, instead of creating a mapping of city with api function. Since, there are only 30 unique values, you can once store the value of city with zipcodes rather than calling the zipcode functions each time, as making an api call is expensive.

mappings = {
'chicago': 'abc123',
'new york': 'abc234',
'los angeles': 'abc345',
'miami': 'abc456'}

Create a dataframe using this dictionary & then merge with the original dataframe

mappings_df = pd.DataFrame(list(mappings.items()), columns=['region', 'zipcode'])
df.merge(mappings_df, how='left', on='region')

Hope this helps!!

Aditya Mishra
  • 1,687
  • 2
  • 15
  • 24
  • "If it's guaranteed that the value will either be present in city or country & not in both" - unfortunately it could be in both also. – SCool Sep 04 '19 at 14:10
0

You need a relation table which can be represented by a dict.

df = pd.DataFrame({'name':['jim','jon'],
                  'city':['new york',''],
                    'county':['','los angeles']})

df['region'] = df['city']  + df['county']
table = {'new york': 'abc123', 'chicago': 'abc124', 'los angeles': 'abc125', 'miami': 'abc126'}
df['region'] = df.region.apply(lambda row: table[row])
print(df)

Output

  name      city       county  region
0  jim  new york               abc123
1  jon            los angeles  abc125
ComplicatedPhenomenon
  • 4,055
  • 2
  • 18
  • 45
  • the zipcode abc123 is an example. if i have 200 new york addresses they can't all have zipcode abc123. maybe i didn't explain clearly sorry. the api will search the customer's address, and return their unique zipcode. i didn't include a full address here `123 main street, xyz, abc etc.` just to keep it simple. – SCool Sep 04 '19 at 14:14
  • @SCool you need to edit your question, for now, the question is solved literally. – ComplicatedPhenomenon Sep 04 '19 at 14:19
  • @SCool if `city` and `county` are both filled, is it guaranteed that they have the same value? – ComplicatedPhenomenon Sep 04 '19 at 14:38
  • unfortunately not. This is messy user generated data. They could even put their phone number in there. However I have a list of roughly 30 location strings I am looking for. `if string x, do api_x ... if string y, do api_y`. I just need to search those two columns. – SCool Sep 04 '19 at 15:07
  • I'll add a new question with more detail tomorrow. – SCool Sep 04 '19 at 15:07