1

I have some addresses that I would like to clean.

You can see that in column address1, we have some entries that are just numbers, where they should be numbers and street names like the first three rows.

df = pd.DataFrame({'address1':['15 Main Street','10 High Street','5 Other Street',np.nan,'15','12'],
                  'address2':['New York','LA','London','Tokyo','Grove Street','Garden Street']})

print(df)

         address1       address2
0  15 Main Street       New York
1  10 High Street             LA
2  5 Other Street         London
3             NaN          Tokyo
4              15   Grove Street
5              12  Garden Street

I'm trying to create a function that will check if address1 is a number, and if so, concat address1 and street name from address2, then delete address2.

My expected output is this. We can see index 4 and 5 now have complete address1 entries:

           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN <---
5  12 Garden Street       NaN <---

What I have tried with the .apply() function:

def f(x):

    try:
        #if address1 is int
        if isinstance(int(x['address1']), int):

            # create new address using address1 + address 2
            newaddress = str(x['address1']) +' '+ str(x['address2'])

            # delete address2
            x['address2'] = np.nan

            # return newaddress to address1 column
            return newadress

    except:
        pass

Applying the function:

df['address1'] = df.apply(f,axis=1)

However, the column address1 is now all None.

I've tried a few variations of this function but can't get it to work. Would appreciate advice.

SCool
  • 3,104
  • 4
  • 21
  • 49

3 Answers3

1

You can create a mask and update:

mask = pd.to_numeric(df.address1, errors='coerce').notna()
df.loc[mask, 'address1'] = df.loc[mask, 'address1'] + ' ' +df.loc[mask,'address2']
df.loc[mask, 'address2'] = np.nan

output:

           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN
5  12 Garden Street       NaN
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

Try this

apply try except and convert address1 in int

def test(row):
    try:
        address = int(row['address1'])
        return 1
    except:
        return 0


df['address1'] = np.where(df['test']==1,df['address1']+ ' '+df['address2'],df['address1'])
df['address2'] = np.where(df['test']==1,np.nan,df['address2'])
df.drop(['test'],axis=1,inplace=True)
        address1    address2
0   15 Main Street    New York
1   10 High Street    LA
2   5 Other Street    London
3   NaN               Tokyo
4   15 Grove Street   NaN
5   12 Garden Street  NaN
tawab_shakeel
  • 3,701
  • 10
  • 26
1

You may avoid apply by using str.isdigit to pick exact rows need to modify. Create a mask m to identify these rows. Use agg on these rows and construct a sub-dataframe for these rows. Finally append back to original df

m = df.address1.astype(str).str.isdigit()
df1 = df[m].agg(' '.join, axis=1).to_frame('address1').assign(address2=np.nan)

Out[179]:
           address1  address2
4   15 Grove Street       NaN
5  12 Garden Street       NaN

Finally, append it back to df

df[~m].append(df1)

Out[200]:
           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN
5  12 Garden Street       NaN

If you still insist to use apply, you need modify f to return outside of if to return non-modify rows together with modified rows

def f(x):
    y = x.copy()
    try:
        #if address1 is int
        if isinstance(int(x['address1']), int):

            # create new address using address1 + address 2
            y['address1'] = str(x['address1']) +' '+ str(x['address2'])

            # delete address2
            y['address2'] = np.nan
    except:
        pass

    return y


df.apply(f, axis=1)

Out[213]:
           address1  address2
0    15 Main Street  New York
1    10 High Street        LA
2    5 Other Street    London
3               NaN     Tokyo
4   15 Grove Street       NaN
5  12 Garden Street       NaN

Note: it is reccommended that apply should not modify the passed object, so I do y = x.copy() and modify and return y

Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • Thanks. Whyis it recommended that apply should not modify the passed object? Do you have some info I can read up on? – SCool Jul 18 '19 at 21:38
  • 1
    @SCool: it may sometimes return unpredictable result as in this thread: https://stackoverflow.com/questions/56961451/pandas-groupby-apply-has-different-behaviour-with-int-and-string-types – Andy L. Jul 18 '19 at 22:10