2

I have a dataframe(df1) as following:

       datetime     m  d    1d    2d   3d
       2014-01-01   1  1     2     2   3
       2014-01-02   1  2     3     4   3
       2014-01-03   1  3     1     2   3
       ...........
       2014-12-01  12  1      2     2   3
       2014-12-31  12  31     2     2   3

Also I have another dataframe(df2) as following:

       datetime     m  d       
       2015-01-02   1  2     
       2015-01-03   1  3     
       ...........
       2015-12-01  12  1      
       2015-12-31  12  31     

I want to merge the 1d 2d 3d columns value of df1 to df2. There are two conditions: (1) only m and d are the same in both df1 and df2 can merge. (2) if the index of df2 index % 30 ==0 don't merge, the value of 1d 2d 3d of these index can be Nan.

I mean I want the new dataframe of df2 like as following:

       datetime     m  d    1d    2d   3d
       2015-01-02   1  2   Nan     Nan   Nan
       2015-01-03   1  3     1     2   3
       ...........
       2015-12-01  12  1      2     2   3
       2015-12-31  12  31     2     2   3

Thanks in advance!

tktktk0711
  • 1,656
  • 7
  • 32
  • 59

3 Answers3

3

I think you need add NaNs by loc and then merge with left join:

np.random.seed(10)
N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())

N = 366
rng = pd.date_range('2016-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.tail())
df2 = df_te.copy()
df1 = df_tr_2014.copy()

df1 = df1.set_index('datetime')
df1.index += pd.offsets.DateOffset(years=1)

#correct 29 February
y = df1.index[0].year
df1 = df1.reindex(pd.date_range(pd.datetime(y,1,1), pd.datetime(y,12,31)))
idx = df1.index[(df1.index.month == 2) & (df1.index.day == 29)]

df1.loc[idx, :] = df1.loc[idx - pd.Timedelta(1, unit='d'), :].values
df1.loc[idx, 'day_m'] = idx.day
df1[['month','day_m']] = df1[['month','day_m']].astype(int)

df1[['7d','15d', '20d']] = df1[['7d','15d', '20d']].astype(float)

df1.loc[np.arange(len(df1.index))  % 30 == 0, ['7d','15d','20d']] = 0
df1 = df1.reset_index()
print (df1.iloc[57:62])
        index  month  day_m   7d  15d  20d
57 2016-02-27      2     27  2.0  0.0  1.0
58 2016-02-28      2     28  2.0  3.0  5.0
59 2016-02-29      2     29  2.0  3.0  5.0
60 2016-03-01      3      1  0.0  0.0  0.0
61 2016-03-02      3      2  7.0  6.0  9.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks, how about the leap year, if there is 29 in Feb 2016, there is no 29 in Feb 2015. If there is no 29 in Feb 2015. the 28 and 29 in Feb 2016 are the same – tktktk0711 Jun 27 '17 at 01:33
  • Hmmm, so you need 30 for each month without february (28,29)? Or need nans for last day of each month?(28,29,30,31)? – jezrael Jun 27 '17 at 03:42
  • thanks, I mean the year 2016 is leap year , the February have 29, but there is no Feb 29 in 2015. In this case, if there is no 29 in the last year, the data of 29 Feb is the same as data of 28 Feb. – tktktk0711 Jun 27 '17 at 05:29
  • OK, sorry I forget for your comment. January has 31 days, do you need select 30 day for it also? February understand. March is too 31... So need select 30 also? thanks. – jezrael Jun 27 '17 at 08:36
  • thanks for your kind comments. I mean consider the leap year(just consider Feb). For example 2016 is a leap year, if we want to get the data of a year ago of Feb 29 2016, namely Feb 29 of 2015. But there is no data for Feb 29 of 2015 since it is not a leap year. In this case, the data of a year ago of Feb 29 ,2016 is same with Feb 28 ,2016. – tktktk0711 Jun 27 '17 at 08:51
  • thanks, I have solved by f_tr2.index[i]- pd.offsets.DateOffset(years=1) which you have answered before. Anyway, thanks! – tktktk0711 Jun 29 '17 at 05:29
  • Thank you for comment. My main problem is I dont understand it, why is necessary substract one year and one day? Can you explain it? – jezrael Jun 29 '17 at 05:30
  • thanks I haven't explained in detail. But after you have checked my code in answer(I will post my answer), you will understand it. – tktktk0711 Jun 29 '17 at 06:05
  • Also maybe for your code is necessary use larger dataframe, be free use my data sample and change it if necessary. – jezrael Jun 29 '17 at 06:07
  • thanks,my code take more time . Could you optimalize it – tktktk0711 Jun 29 '17 at 06:09
  • But there are no 30 days what is in question. Can you explain it more? – jezrael Jun 29 '17 at 06:12
  • I have uploaded my new answers about this question, but it take a lot of more, if possible, you can optimize it since I used the loop – tktktk0711 Jun 29 '17 at 06:21
  • Can you test if sample data can be used for your solution - I add it to your answer. – jezrael Jun 29 '17 at 07:01
  • hi @jezrael, there is question:https://stackoverflow.com/questions/44946555/python2-7-dataframe-date-format-y-m-d-doesnt-work – tktktk0711 Jul 06 '17 at 10:46
  • hi @jezrael, there is question: https://stackoverflow.com/questions/45564867/python-2-7-shift-a-dataframe-by-day-and-a-column-value thanks! – tktktk0711 Aug 08 '17 at 09:42
  • Do you need `NaN`s? `df = df.shift(1, freq='D')` does not work? – jezrael Aug 08 '17 at 10:08
  • @cᴏʟᴅsᴘᴇᴇᴅ - Thanks :) – jezrael Mar 28 '18 at 07:30
1

Why don't you just remove the rows in df1 that don't match (m, d) pairs in df2?

df_new = df2.drop(df2[(not ((df2.m == df1.m) & (df2.n == df1.n)).any()) or (df2.index % 30 == 0)].index)

Or something along those lines.

Link to a related answer.

I'm not enormously familiar with Pandas and have not tested the above example.

Max von Hippel
  • 2,856
  • 3
  • 29
  • 46
0

df_te is df2 df_tr_2014 is df1 7d 15d 20 is 1d 2d 3d respectively in question. size_df_te is the length of df_te, month and day_m are m, d in df2

df_te['7d'] = 0
df_te['15d'] = 0
df_te['20d'] = 0
mj = 0
dj = 0
for i in range(size_df_te):
    if i%30 != 0:
        m = df_te.loc[i,'month']
        d = df_te.loc[i,'day_m']
        if (m== 2) & (d == 29):
            m = 2
            d = 28
        dk_7 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['7d']
        dk_15 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['15d']
        dk_20 = df_tr_2014.loc[(df_tr_2014['month']==m) & (df_tr_2014['day_m']==d)]['20d']
        df_te.loc[i,'7d'] = float(dk_7)
        df_te.loc[i,'15d'] = float(dk_15)
        df_te.loc[i,'20d'] = float(dk_20)

EDIT:

Sample data:

np.random.seed(10)
N = 365
rng = pd.date_range('2014-01-01', periods=N)
df_tr_2014 = pd.DataFrame(np.random.randint(10, size=(N, 3)), index=rng).reset_index()
df_tr_2014.columns = ['datetime','7d','15d','20d']
df_tr_2014.insert(1,'month', df_tr_2014['datetime'].dt.month)
df_tr_2014.insert(2,'day_m', df_tr_2014['datetime'].dt.day)
#print (df_tr_2014.head())

N = 365
rng = pd.date_range('2015-01-01', periods=N)
df_te = pd.DataFrame(index=rng)
df_te['month'] = df_te.index.month
df_te['day_m'] = df_te.index.day
df_te = df_te.reset_index()
#print (df_te.head())
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
tktktk0711
  • 1,656
  • 7
  • 32
  • 59