0

df with imperfect data:

df = pd.DataFrame({'A Surname' : ['Smith', 'Longshore', 'Jones'], 
                       'A Title': ['Mr', 'Miss', np.nan],
                       'B Surname' : ['Smith', np.nan, 'Nguyen'], 
                       'B Title': ['Mrs', np.nan, np.nan]})

I'm after a column containing a string appropriate to address both A and B where possible. Combining fields returns np.nan if any are np.nan, and it needs to be logical (eg. don't use 'B Title' if 'B Surname' is np.nan), so I need a series of rules to determine the most appropriate combination. My unsuccessful approach:

def combined(x):
    full = df['A Title'] + ' ' & df['A Surname'] & ' & ' & df['B Title'] & ' ' & df['B Surname']
    no_title = df['A Surname'] & ' & ' & df['B Surname']
    # more combinations
    if full != np.nan:
        return full
    elif no_title != np.nan:
        return no_title
    # more elifs
    else:
        return df['A Surname']
        
df['combined string'] = np.nan
df['combined string'] = df['combined string'].apply(combined)

The desired output would look like:

desired_df = pd.DataFrame({'A Surname' : ['Smith', 'Longshore', 'Jones'], 
                       'A Title': ['Mr', 'Miss', 'Mr'],
                       'B Surname' : ['Smith', np.nan, 'Whatever'], 
                       'B Title': ['Mrs', np.nan, np.nan],
                       'combined string': ['Mr Smith & Mrs Smith', 'Miss Longshore', 'Jones & Whatever']})

What's a practical way to do this?

  • Check this [stackoverflow question about this `ValueError`](https://stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o) . Also, you should add what would be the expected output, following [stackoverflow's guideline](https://stackoverflow.com/help/how-to-ask) – Felipe Whitaker Oct 22 '20 at 03:26

2 Answers2

3

Use Series.str.cat with Series.str.strip here:

a = df['A Title'].str.cat(df['A Surname'], sep=' ', na_rep='').str.strip()
b = df['B Title'].str.cat(df['B Surname'], sep=' ', na_rep='').str.strip()
df['combined string'] = a.str.cat(b, sep=' & ').str.strip(' &')
print (df)
   A Surname A Title B Surname B Title       combined string
0      Smith      Mr     Smith     Mrs  Mr Smith & Mrs Smith
1  Longshore    Miss       NaN     NaN        Miss Longshore
2      Jones     NaN    Nguyen     NaN        Jones & Nguyen
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Figured out a solution that applies the function with axis=1, allowing references to other columns in the function.

def combined(x):
    if x['B Surname'] == '':
        if x['A Title'] == '':
            return x['A Surname']
        else:
            return x['A Title'] + ' ' + x['A Surname']
    else:
        if (x['A Title'] == '') or (x['B Title'] == ''):
            return x['A Surname'] + ' & ' + x['B Surname']
        else:
            return x['A Title'] + ' ' +  x['A Surname'] + ' & ' + x['B Title'] + ' ' + x['B Surname']


df = df.fillna('')
df['combined string'] = df.apply(combined, axis=1)

np.nan values had to be converted to empty strings, as I couldn't figure out how to avoid turning np.nans into 'nan' strings or reassign column values within the function.

Note df = df.fillna('') applies to entire df. All empty strings can then be turned into nan through df = df.replace(r'^\s*$', np.nan, regex=True)