0

I am trying to write a for loop that loops through a data frame and assigns either 0 or the first three digits of the given zip code depending on the population. My TA says I need to fix the second line to loop through the index rather than the length of the data frame but I am unsure how to move forward. Here's the question and my code.

"In this part, you should write a for loop, updating the df_users dataframe. Go through each user, and update their zip code, to Safe Harbor specifications: If the user is from a zip code for the which the "Geographic Subdivision" is less than equal to 20,000, change the zip code in df_users to '0' (as a string). Otherwise, zip should be only the first 3 numbers of the full zip code. Do all this by directly updating the zip column of the df_users DataFrame."

for item in range(0, len(df_users)):

    population = zip_dict[df_zip.loc[item, 'population']]
    if population <= 20000:
        df_users.loc[item, 'zip'] = '0'
    else: 
        new_zip = (df_users.loc[item, 'zip'])[:3]
        df_users.loc[item, 'zip'] = new_zip
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Can you please provide a sample of your input? What df_users and df_zip looks like. – NYC Coder May 12 '20 at 00:53
  • Please [provide a reproducible copy of the DataFrame with `df.to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246/how-to-provide-a-copy-of-your-dataframe-with-to-clipboard) – Trenton McKinney May 12 '20 at 01:10

1 Answers1

0

Use .apply and np.where

  • Using a for-loop with a pandas dataframe is not advised and leads to poor performance.
import pandas as pd
import numpy as np

# dataframe example
df = pd.DataFrame({'pop': [10000, 20000, 30000], 'zip': [12345, 97000, 87390]})

   pop    zip
 10000  12345
 20000  97000
 30000  87390

# update zip based on pop
df['zip'] = df.apply(lambda x: np.where(x['pop'] < 20000, '0', str(x['zip'])[:3]), axis=1)

   pop  zip
 10000    0
 20000  970
 30000  873

If you have to use a for-loop

  • You Shouldn't, this is a pandas anti-pattern
  • range(0, len(df_users)) produces a list from 0, 1, ..., len(df)-1, however, the index may not be ordered numerically from 0, 1, .... This is probably why you were instructed to change to df.index
  • Presumably, the zip codes are numeric. (df_users.loc[item, 'zip'])[:3] can't be used with an int, which is why str(df.loc[i, 'zip'])[:3] is used. If the zip column is object or str type, then you can use df.loc[i, 'zip'][:3]
for i in df.index:
    pop = df.loc[i, 'pop']
    if pop < 20000:
        df.loc[i, 'zip'] = '0'
    else:
        df.loc[i, 'zip'] = str(df.loc[i, 'zip'])[:3]
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158