1

I want to check if a column entry matches a city on a list of cities (region), if there is a match, then I want to add to a column a string with the region zipcode (region_name) and if it does not match then I want to keep the current column value.

A review of cases

I tried a new library (modin) and made a few changes (including installing pylint as prompted by a popup) and afterward, replace() no longer worked with a column.

import pandas as pd
df = pd.DataFrame({'city_nm': ['Cupertino', 'Mountain View', 'Palo Alto'],'zip_cd': ['95014', False, '94306']})
region_name = '99999'
region = ['Cupertino', 'Mountain View', 'Palo Alto']

def InferZipcodeFromCityName(df, region, region_name):
    PATTERN_CITY = '|'.join(region)
    foundZipbyCity = ( 
        (df['zip_cd'] == False) &
        (df['cty_nm'].str.contains(PATTERN_CITY, flags=re.IGNORECASE) ) 
        )
    df['zip_cd'] = foundZipbyCity.replace( (True,False), (region_name, df['zip_cd']) )  
    return df

#this is what I want
In[1]: df = InferZipcodeFromCityName(df, region, region_name)
Out[1]: 
   city_nm  zip_cd
0  'Cupertino'  '95014'
1  'Mountain View'  '99999'
2  'Palo Alto'  '94306'

#this is what I get --> AssertionError

try 1: df['zip_cd'] = foundZipbyCity.replace( (True,False), (region_name, df['zip_cd']), regex = False )  #AssertionError
try 2: df['zip_cd'] = foundZipbyCity.replace( (True,False), (region_name, region_name]) ) #changed to (string,string) and works fine, however, it does nothing useful

EDIT: On a second and third laptop, I installed Anaconda and VS Code and it works fine: on this first laptop, I uninstalled anaconda and vs code, and reinstalled with no effect(this laptop worked fine with this code for a year up until I tried the modin library--probably a coincidence but still)

forest.peterson
  • 755
  • 2
  • 13
  • 30
  • Try setting `regex=False` explicitly in replace e.g. `df['zip_cd'] = replace((True, False), (region_name, df['zip_cd']), regex=False)` – forgetso Nov 15 '20 at 10:16
  • Do you have a bigger stack trace than this? It might give details of the specific pandas internal function that went wrong. – forgetso Nov 15 '20 at 10:25
  • The fact that you can't even see pandas in the stack trace is suspicious. Maybe pylint is somehow killing the process before pandas is even run. – forgetso Nov 15 '20 at 10:51

1 Answers1

1

The problem is that you're expecting that in this statement all False values will be grabbed from df["zip_cd"]:

df['zip_cd'] = foundZipbyCity.replace( (True, False), (region_name, df['zip_cd']) )

However that's not true, and what's happening here is that we will try to replace False to a Series False -> df["zip_cd"] and pandas seems to fail to replace False scalar to a Series.

What you're probably want to do here is replace all values in df["zip_cd"] that satisfies foundZipbyCity mask to region_name

df["zip_cd"][foundZipbyCity] = region_name

I've run your code with this change and it output the expected result.

Dmitry Chigarev
  • 140
  • 1
  • 4
  • Thank you. I'm not sure what the runtime of this line of code was, however, I used it a dozen times and your version should be 2x faster--I still wonder why my code worked on my 'left side' MS Surface but not on my 'right side' MS Surface: luke the spook at work – forest.peterson Nov 17 '20 at 05:31
  • this warning is set each time: See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy – forest.peterson Nov 17 '20 at 06:17