Replace Nan in DataFrame based on value in Second DataFrame

Question

Seems like there are many DF questions with conditions from another DF but I can't find any that do what I need. Both dataframes are small samples. They are each thousands of columns wide. I have a DataFrame (df1) that looks like this

             IBM    BA      CAT     IBM EARN    BA EARN   CAT EARN
Date
1/22/2018   163.13  65.94   76.50     NaN        NaN       NaN
1/23/2018   163.17  65.94   76.51     NaN        NaN       NaN
1/24/2018   167.26  67.43   79.23     NaN        NaN       NaN
1/25/2018   166.28  67.77   80.57     NaN        NaN       NaN
1/26/2018   166.58  68.37   80.87     NaN        NaN       NaN
1/27/2018   166.77  68.87   81.07     NaN        NaN       NaN
1/28/2018   167.98  68.57   81.07     NaN        NaN       NaN
2/1/2018    167.98  68.77   81.59     NaN        NaN       NaN
2/2/2018    167.98  69.07   81.87     NaN        NaN       NaN

I have another dataframe (df2) with identical columns as the last three in df1 but with specific dates

    IBM EARN    BA EARN     CAT EARN
0   1/22/2018   2/1/2018    1/26/2018
1   10/19/2017  10/26/2017  10/25/2017
2   7/20/2017   7/27/2017   7/26/2017
3   4/20/2017   4/27/2017   4/26/2017
4   1/23/2017   1/26/2017   1/27/2017
5   10/19/2016  10/27/2016  10/26/2016
6   7/20/2016   7/28/2016   7/27/2016

I want to place a 1 in df1 where there is a corresponding date in df2. So the (partial) result would look like this but would continue for all dates list in df2.

             IBM     BA     CAT     IBM EARN    BA EARN   CAT EARN
Date
1/22/2018   163.13  65.94   76.50    **1**       NaN       NaN
1/23/2018   163.17  65.94   76.51     NaN        NaN       NaN
1/24/2018   167.26  67.43   79.23     NaN        NaN       NaN
1/25/2018   166.28  67.77   80.57     NaN        NaN       NaN
1/26/2018   166.58  68.37   80.87     NaN        NaN        **1**
1/27/2018   166.77  68.87   81.07     NaN        NaN       NaN
1/28/2018   167.98  68.57   81.07     NaN        NaN       NaN
2/1/2018    167.98  68.77   81.59     NaN        **1**     NaN
2/2/2018    167.98  69.07   81.87     NaN        NaN       NaN

Please let me know if you can help with a solution

I dont have the entire output unfortunately because it is more than 1000 lines but df1 has daily dates (as the index) and stock prices and I want to put a 1 in each instance from df2 where there is a date. So for IBM it would be 1/22/2018 (as shown) and then 10/19/17 then 7/20/17,etc. Same for the other 2 columns with the dates — J Westwood, Apr 12 '18 at 05:15
@JWestwood - I add also non loop solution, if you are interested. — jezrael, Apr 12 '18 at 05:51

jezrael · Accepted Answer · 2018-04-12T05:54:11.487

For each column of second DaatFrame check membership by Index.isin and with numpy.where replace values:

for col in df2.columns:
    df1[col] = np.where(df1.index.isin(df2[col]),1,np.nan)
print (df1)
              IBM     BA    CAT  IBM EARN  BA EARN  CAT EARN
Date                                                        
1/22/2018  163.13  65.94  76.50       1.0      NaN       NaN
1/23/2018  163.17  65.94  76.51       NaN      NaN       NaN
1/24/2018  167.26  67.43  79.23       NaN      NaN       NaN
1/25/2018  166.28  67.77  80.57       NaN      NaN       NaN
1/26/2018  166.58  68.37  80.87       NaN      NaN       1.0
1/27/2018  166.77  68.87  81.07       NaN      NaN       NaN
1/28/2018  167.98  68.57  81.07       NaN      NaN       NaN
2/1/2018   167.98  68.77  81.59       NaN      1.0       NaN
2/2/2018   167.98  69.07  81.87       NaN      NaN       NaN

EDIT:

Non loop solution with DataFrame.isin by dictionary of list created by df2 with cast boolean mask to integers:

#first create DataFrame by repeat index of df1
#https://stackoverflow.com/a/45118399
arr = np.broadcast_to(df1.index[:, None], (len(df1), len(df2.columns)))
df3 = pd.DataFrame(arr, columns=df2.columns, index=df1.index)

df3 = df3.isin(df2.to_dict('l')).astype(int)
print (df3)
           IBM EARN  BA EARN  CAT EARN
Date                                  
1/22/2018         1        0         0
1/23/2018         0        0         0
1/24/2018         0        0         0
1/25/2018         0        0         0
1/26/2018         0        0         1
1/27/2018         0        0         0
1/28/2018         0        0         0
2/1/2018          0        1         0
2/2/2018          0        0         0

df1 = df1.drop(df2.columns, 1).join(df3)
print (df1)
              IBM     BA    CAT  IBM EARN  BA EARN  CAT EARN
Date                                                        
1/22/2018  163.13  65.94  76.50         1        0         0
1/23/2018  163.17  65.94  76.51         0        0         0
1/24/2018  167.26  67.43  79.23         0        0         0
1/25/2018  166.28  67.77  80.57         0        0         0
1/26/2018  166.58  68.37  80.87         0        0         1
1/27/2018  166.77  68.87  81.07         0        0         0
1/28/2018  167.98  68.57  81.07         0        0         0
2/1/2018   167.98  68.77  81.59         0        1         0
2/2/2018   167.98  69.07  81.87         0        0         0

very nice solution without loops. – shivsn Apr 12 '18 at 05:56 — shivsn, Apr 12 '18 at 05:56

score 1 · Answer 2 · answered Apr 12 '18 at 05:16

you can try this as date is your index:

In [18]: df1['IBMEARN'] = np.where(df1.index.isin(df2.IBMEARN),1,0)

In [19]: df1['BAEARN'] = np.where(df1.index.isin(df2.BAEARN),1,0)

In [21]: df1['CATEARN'] = np.where(df1.index.isin(df2.CATEARN),1,0)
In [22]: df1
Out[22]: 
              IBM     BA    CAT  IBMEARN  BAEARN  CATEARN
DATE                                                     
1/22/2018  163.13  65.94  76.50        1       0        0
1/23/2018  163.17  65.94  76.51        0       0        0
1/24/2018  167.26  67.43  79.23        0       0        0
1/25/2018  166.28  67.77  80.57        0       0        0
1/26/2018  166.58  68.37  80.87        0       0        1
1/27/2018  166.77  68.87  81.07        0       0        0
1/28/2018  167.98  68.57  81.07        0       0        0
2/1/2018   167.98  68.77  81.59        0       1        0
2/2/2018   167.98  69.07  81.87        0       0        0

Yes that does work so it is correct but I have more than 3000 columns (stocks) so I guess I could loop through them and do it but was wondering if there was a more pythonic way to do all those columns rather than a loop. let me know if you can think of anything. ty — J Westwood, Apr 12 '18 at 05:24
I see loop posted from @jezreal. That seems simple enough. I think both answer are correct. I will mark as such if I can credit 2 answers — J Westwood, Apr 12 '18 at 05:28
@JWestwood I think jezreal answer is better you should accept that. — shivsn, Apr 12 '18 at 05:55

Replace Nan in DataFrame based on value in Second DataFrame

2 Answers2