0

I have 2 pandas dataframes, both of them having the same columns but different line numbers depending on missing rows, one of the columns is the Date having the following format 29/09/2020 13.22.57 the day month year omitted sometimes below for simplicity and irrelevance The Dates might be an exact match in df as df_2 or there might be an acceptable delay of a threshold that we preset, in this case it's 2s.

Sample data of df['Date']:

13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
13.26.54

Sample data of df_2['Date']:

13.24.19    
13.24.35                        
13.25.23                        
13.26.13    
13.26.38

expected

df['Date']:

13.22.57    
13.23.13    
13.23.44    
13.24.02    
13.24.19    
13.24.35
0                       
13.25.23
0                       
13.26.13    
13.26.38



df_2['Date']:


13.24.19
13.24.35
13.25.07
13.25.23
13.26.00
13.26.13
0
13.26.54

the increment can happen for either df or df_2 depends which has a bigger time for the missing column and at the end both should have the same number of rows as the rows that are not matching will have a 0 value now and an increment will happen to the one under.

Dataframes:

d = {'Date': ['13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')

d2 = {'Date': ['13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')
sara
  • 109
  • 1
  • 7
  • can you provide the input data as DataFrame constructors for reproducibility? – mozway Mar 28 '22 at 15:11
  • @mozway sample DataFrame constructors added in the Question, the other columns don't matter, only Date does for my operation hopefully it would be useful. – sara Mar 28 '22 at 15:44
  • the test code of creating the data is not reproducible – Ran A Mar 28 '22 at 15:50
  • https://stackoverflow.com/questions/15888648/is-it-possible-to-insert-a-row-at-an-arbitrary-position-in-a-dataframe-using-pan check this answer it might help – Ran A Mar 28 '22 at 15:53
  • thanks @sara for the update, I provided an [answer](https://stackoverflow.com/a/71651834/16343464) – mozway Mar 28 '22 at 18:10

3 Answers3

1

This works for me: Note, Assumption is that len(df)>len(df_2)

a={"Date": [
"29-09-2020 13:22:57",
"29-09-2020 13:23:12",
"29-09-2020 13:23:44",
"29-09-2020 13:24:01",
"29-09-2020 13:24:19",
"29-09-2020 13:24:35",
"29-09-2020 13:25:07",
"29-09-2020 13:25:23",
"29-09-2020 13:26:00",
"29-09-2020 13:26:13",
"29-09-2020 13:26:54",
]}
b={"Date":[
    "29-09-2020 13:22:57",    
    "29-09-2020 13:23:13",    
    "29-09-2020 13:23:44",    
    "29-09-2020 13:24:02",    
    "29-09-2020 13:24:19",    
    "29-09-2020 13:24:35",                        
    "29-09-2020 13:25:23",                        
    "29-09-2020 13:26:13",    
    "29-09-2020 13:26:38",
]
}
df=pd.DataFrame(a)
df["Date"]=pd.to_datetime(df["Date"])
df_2=pd.DataFrame(b)
df_2["Date"]=pd.to_datetime(df_2["Date"])


def add_zero(dataframe,index,increment):
    dataframe.loc[index+increment]=0
    dataframe = dataframe.sort_index().reset_index(drop=True)
    return dataframe

flag=True
idx=0
while flag==True:
    if idx >= len(df_2["Date"]):
        df_2=add_zero(df_2,idx,0.5)
        break
    if idx >= len(df["Date"]):
        df=add_zero(df,idx,0.5)
        break
    print(idx)
    print(df['Date'][idx])
    print(df_2['Date'][idx])
    diff=datetime.timedelta.total_seconds(df['Date'][idx] - df_2['Date'][idx])
    print(f"Diff: {diff}")
    if diff > 2:
        df=add_zero(df,idx,-0.5)
        print("greater")
    elif diff < -2:
        df_2=add_zero(df_2,idx,-0.5)
        print("smaller")
    else:
        print("Acceptable")

    idx=idx+1

    if idx>=max(len(df_2),len(df)):
        flag=False

Output

    Date                Date2
0   2020-09-29 13:22:57 2020-09-29 13:22:57
1   2020-09-29 13:23:12 2020-09-29 13:23:13
2   2020-09-29 13:23:44 2020-09-29 13:23:44
3   2020-09-29 13:24:01 2020-09-29 13:24:02
4   2020-09-29 13:24:19 2020-09-29 13:24:19
5   2020-09-29 13:24:35 2020-09-29 13:24:35
6   2020-09-29 13:25:07 0
7   2020-09-29 13:25:23 2020-09-29 13:25:23
8   2020-09-29 13:26:00 0
9   2020-09-29 13:26:13 2020-09-29 13:26:13
10  0                   2020-09-29 13:26:38
11  2020-09-29 13:26:54 0
0

This should work for every dataframe:

df.loc[df.shape[0]] = [None for _ in range(len(df.columns))]
pbartkow
  • 126
  • 1
  • 10
  • Do you want to do that in real time or to apply on df? – pbartkow Mar 28 '22 at 16:21
  • Then you should use the [solution](https://stackoverflow.com/questions/15888648/is-it-possible-to-insert-a-row-at-an-arbitrary-position-in-a-dataframe-using-pan) mentioned above – pbartkow Mar 28 '22 at 16:44
0

IIUC, you could perform a double merge.

First a merge_asof with direction='nearest' and a tolerance of 2s, to align in the values of the second dataframe relative to the first one.

Then a classical outer merge to fill in the missing values from the second dataframe.

Finally, sort by Date using bfill to get a single column as reference.

NB. merge_asof required prior sorting of the two dataframes by Date.

df2 = (pd.merge_asof(df, df_2.add_suffix('_2'),
              left_on='Date', right_on='Date_2',
              direction='nearest', tolerance=pd.Timedelta('2s'))
   .merge(df_2[['Date']].add_suffix('_2'), on='Date_2', how='outer')
)
df2 = df2.loc[df2[['Date', 'Date_2']].bfill(axis=1)['Date'].sort_values().index]

output:

                  Date  col2              Date_2  col2_2
0  1900-01-01 13:22:57   1.0 1900-01-01 13:22:57     1.0
1  1900-01-01 13:23:12   2.0 1900-01-01 13:23:12     2.0
2  1900-01-01 13:23:44   3.0 1900-01-01 13:23:44     3.0
3  1900-01-01 13:24:01   4.0 1900-01-01 13:24:01     4.0
4  1900-01-01 13:24:19   5.0 1900-01-01 13:24:19     5.0
5  1900-01-01 13:24:35   6.0 1900-01-01 13:24:35     6.0
6  1900-01-01 13:25:07   7.0                 NaT     NaN
9  1900-01-01 13:25:23   8.0 1900-01-01 13:25:23     7.0
7  1900-01-01 13:26:00   9.0                 NaT     NaN
10 1900-01-01 13:26:13  10.0 1900-01-01 13:26:13     8.0
11                 NaT   NaN 1900-01-01 13:26:38     NaN
8  1900-01-01 13:26:54  11.0                 NaT     NaN

used input:

d = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.07', '13.25.23','13.26.00', '13.26.13','13.26.54'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df['Date'], format='%H.%M.%S')

d2 = {'Date': ['13.22.57', '13.23.12','13.23.44', '13.24.01','13.24.19', '13.24.35','13.25.23', '13.26.13','13.26.38'], 'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9]}
df_2 = pd.DataFrame(data=d2)
df_2['Date'] = pd.to_datetime(df_2['Date'], format='%H.%M.%S')
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Strangely i got an error on this line ' direction='nearest', tolerance=pd.Timedelta('2s'))', raise MergeError("key must be integer, timestamp or float") pandas.errors.MergeError: key must be integer, timestamp or float. – sara Mar 28 '22 at 19:22
  • Make sure to first test that you can reproduce the example with the code from the "*used input*" section. What is your pandas version? – mozway Mar 28 '22 at 19:24