3

Here is a dataframe of financial instrument identifiers

import pandas as pd
import numpy as np

df = pd.DataFrame([["ISIN1", "CUSIP1", "SEDOL1"], 
                  ["ISIN2", "CUSIP2", "SEDOL2"], 
                  ["ISIN3", "CUSIP3", "SEDOL3"], 
                  ["ISIN4", "CUSIP4", "SEDOL4"]], 
                  columns=["ISIN", "CUSIP", "SEDOL"])

df

    ISIN    CUSIP   SEDOL
0   ISIN1   CUSIP1  SEDOL1
1   ISIN2   CUSIP2  SEDOL2
2   ISIN3   CUSIP3  SEDOL3
3   ISIN4   CUSIP4  SEDOL4

Imagine that several entries are missing

df.iloc[(1,1)]  = np.nan
df.iloc[(1,2)]  = np.nan
df.iloc[(2,0)]  = np.nan
df.iloc[(3,0)]  = np.nan
df.iloc[(3,1)]  = np.nan
df

    ISIN    CUSIP   SEDOL
0   ISIN1   CUSIP1  SEDOL3
1   ISIN2   NaN     NaN
2   NaN     CUSIP3  SEDOL3
3   NaN     NaN     SEDOL4

In column ID I want to capture a single variable based on this hierarchy : If ISIN is missing, I want to populate CUSIP. If CUSIP is also missing I want to populate SEDOL.

I tried this nested if statement:

def identifier(row):

    if ~pd.isnull(row['ISIN']):
        return row['ISIN']
    elif pd.isnull(row['ISIN']) & ~pd.isnull(row['CUSIP']):
        return row['CUSIP']
    elif pd.isnull(row['ISIN']) & pd.isnull(row['CUSIP']) & ~pd.isnull(row['SEDOL']):
        return row['SEDOL']

df['ID'] = df[['SEDOL', 'CUSIP', 'ISIN']].apply(identifier, axis=1)

Which returned bad output in the ID column for the last 2 entries.

    ISIN    CUSIP   SEDOL   ID
0   ISIN1   CUSIP1  SEDOL1  ISIN1
1   ISIN2   NaN     NaN     ISIN2
2   NaN     CUSIP3  SEDOL3  NaN
3   NaN     NaN     SEDOL4  NaN

My expected output is this:

    ISIN    CUSIP   SEDOL   ID
0   ISIN1   CUSIP1  SEDOL1  ISIN1
1   ISIN2   NaN     NaN     ISIN2
2   NaN     CUSIP3  SEDOL3  CUSIP3
3   NaN     NaN     SEDOL4  SEDOL4

Hope i've explained that well. Note that 'ISIN' is a string. There is no use of the .isin function in my code. Thank you in advance.

alanindublin
  • 101
  • 6

3 Answers3

3

In general, you can implement elif logic using np.select, which is detailed here and here.

In this case you can do this succinctly with lookup + notnull().idxmax to find the first non-null value in each row. I added an extra all NaN row to show how that is treated.

df['ID'] = df.lookup(df.index, df.notnull().idxmax(1))

#    ISIN   CUSIP   SEDOL      ID
#0  ISIN1  CUSIP1  SEDOL1   ISIN1
#1  ISIN2     NaN     NaN   ISIN2
#2    NaN  CUSIP3  SEDOL3  CUSIP3
#3    NaN     NaN  SEDOL4  SEDOL4
#4    NaN     NaN     NaN     NaN

To explain the issue with your original problem, it's the use of ~ with pd.isnull.

df['ISIN'].apply(lambda x: ~pd.isnull(x))
0   -1
1   -1
2   -2
3   -2
4   -2
Name: ISIN, dtype: int64

These are not 0, so they all evaluate to True, meaning for every row you take the 'ISIN' column. You should use pd.notnull (not pd.isnull also works) instead of ~pd.isnull

df['ISIN'].apply(lambda x: pd.notnull(x))
0     True
1     True
2    False
3    False
4    False
Name: ISIN, dtype: bool
ALollz
  • 57,915
  • 7
  • 66
  • 89
1

IIUC using bfill

df['ID']=df.bfill(1).iloc[:,0]
df
Out[346]: 
    ISIN   CUSIP   SEDOL      ID
0  ISIN1  CUSIP1  SEDOL3   ISIN1
1  ISIN2     NaN     NaN   ISIN2
2    NaN  CUSIP3  SEDOL3  CUSIP3
3    NaN     NaN  SEDOL4  SEDOL4
BENY
  • 317,841
  • 20
  • 164
  • 234
1
from functools import reduce
df.loc[:, 'ID'] = reduce(lambda c1, c2: c1.combine_first(c2), [df[c] for c in df])
Out[68]: 
    ISIN   CUSIP   SEDOL      ID
0  ISIN1  CUSIP1  SEDOL1   ISIN1
1  ISIN2     NaN     NaN   ISIN2
2    NaN  CUSIP3  SEDOL3  CUSIP3
3    NaN     NaN  SEDOL4  SEDOL4