3

I have two columns of dates need to be compared, date1 is a list of certain dates, date2 is random date (dob). I need to compare month and day by some conditon to make a flag. sample like:

df_sample = DataFrame({'date1':('2015-01-15','2015-01-15','2015-03-15','2015-04-15','2015-05-15'),
                       'dob':('1999-01-25','1987-12-12','1965-03-02','2000-08-02','1992-05-15')}

I create a function based on condition below

def eligible(date1,dob):
  if date1.month - dob.month==0 and date1.day <= dob.day:
    return 'Y'
  elif date1.month - dob.month==1 and date1.day > dob.day:
    return 'Y'
  else:
    return 'N'

I want to apply this function to orginal df which has more than 5M rows, hence for loop is not efficiency, is there any way to achieve this?

Datatype is date, not datetime

DYZ
  • 55,249
  • 10
  • 64
  • 93
Baiii
  • 99
  • 3
  • 8

2 Answers2

3

I think you need numpy.where with conditions chained by | (or):

df_sample['date1'] = pd.to_datetime(df_sample['date1'])
df_sample['dob'] = pd.to_datetime(df_sample['dob'])

months_diff = df_sample.date1.dt.month  - df_sample.dob.dt.month
days_date1 = df_sample.date1.dt.day
days_dob = df_sample.dob.dt.day

m1 = (months_diff==0) & (days_date1 <= days_dob)
m2 = (months_diff==1) & (days_date1 > days_dob)

df_sample['out'] = np.where(m1 | m2 ,'Y','N')
print (df_sample)
       date1        dob out
0 2015-01-15 1999-01-25   Y
1 2015-01-15 1987-12-12   N
2 2015-03-15 1965-03-02   N
3 2015-04-15 2000-08-02   N
4 2015-05-15 1992-05-15   Y
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks, if I prefer date datatype, is there any way to achieve this? – Baiii Sep 08 '17 at 07:04
  • Yes, it is possible. but pandas native type is datetime, so date can be slowier and not possible use `dt.*` functions. But if need date as last step use `df_sample['date1'] = df_sample['date1'].dt.date` – jezrael Sep 08 '17 at 07:06
0

Using datetime is certainly beneficial:

df_sample['dob'] = pd.to_datetime(df_sample['dob'])
df_sample['date1'] = pd.to_datetime(df_sample['date1'])

Once you have it, your formula can be literally applied to all rows:

df_sample['eligible'] = 
              (  (df_sample.date1.dt.month == df_sample.dob.dt.month)\
               & (df_sample.date1.dt.day <= df_sample.dob.dt.day)) |\
              (  (df_sample.date1.dt.month - df_sample.dob.dt.month == 1)\
               & (df_sample.date1.dt.day > df_sample.dob.dt.day))

The result is boolean (True/False), but you can easily convert it to "Y"/"N", if you want.

DYZ
  • 55,249
  • 10
  • 64
  • 93