0

How to compare and match beginning and end of two ranges of timestamps in two different dataframes, when the frequency of timestamps varies, and it is not known which range starts earlies and finishes later. Then discard unmatched beginning and end, so the two ranges are the same. Easy to do it manually in a txt file, how to do it in python and pandas dataframes?

Sample first dataframe:

                         0                          1
0      2022-10-30 14:11:57
1      2022-10-30 14:11:57
2      2022-10-30 14:11:57
3      2022-10-30 14:11:58
4      2022-10-30 14:11:59
                   ...                        ...
149801 2022-10-30 15:22:11
149802 2022-10-30 15:22:11
149803 2022-10-30 15:22:11
149804 2022-10-30 15:22:11
149805 2022-10-30 15:22:11

\[149806 rows x 2 columns\]

Sample second dataframe:

                        0                          1
0     2022-10-30 14:11:59
1     2022-10-30 14:11:59
2     2022-10-30 14:12:00
3     2022-10-30 14:12:00
4     2022-10-30 14:12:00
                  ...                        ...
21065 2022-10-30 15:22:11
21066 2022-10-30 15:22:11
21067 2022-10-30 15:22:12
21068 2022-10-30 15:22:13
21069 2022-10-30 15:22:13

Column 1 filled with data

Comparing two timestamps in a specific row would look like:

if first_df[0].iloc[0] == second_df[0].iloc[0]:
    print('hit')
else:
    print('miss')

How to do it over full range, so it would be possible to discard unmatched beginning and end while preserving what's inside?

Sample match of those two ranges: First dataframe:

                         0                          1
4      2022-10-30 14:11:59
                   ...                        ...
149801 2022-10-30 15:22:11
149802 2022-10-30 15:22:11
149803 2022-10-30 15:22:11
149804 2022-10-30 15:22:11
149805 2022-10-30 15:22:11

Second dataframe:

                        0                          1
0     2022-10-30 14:11:59
1     2022-10-30 14:11:59
2     2022-10-30 14:12:00
3     2022-10-30 14:12:00
4     2022-10-30 14:12:00
                  ...                        ...
21065 2022-10-30 15:22:11
21066 2022-10-30 15:22:11

Edit:

Consider this code (note that frequency of timestamps in each dataframe is different):

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'val_1' : [10,11,12,13,14,15]}, 
                   index = [pd.DatetimeIndex([datetime.strptime(s, '%Y-%m-%d %H:%M:%S')])[0] 
                            for s in ['2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:04:00',
                                      '2022-11-12 09:04:01',
                                      '2022-11-12 09:04:02' 
                                      ] ])

df2 = pd.DataFrame({'val_2': [11,22,33,44]},
                   index = [pd.DatetimeIndex([datetime.strptime(s, '%Y-%m-%d %H:%M:%S')])[0] 
                            for s in ['2022-11-12 09:03:58',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:04:00',
                                      ] ])

What I would like as result is this:

                     val_1  val_2
2022-11-12 09:03:59     10    NaN
2022-11-12 09:03:59     11     22
2022-11-12 09:03:59     12     33
2022-11-12 09:04:00     13     44

or: df1:

2022-11-12 09:03:59     10
2022-11-12 09:03:59     11
2022-11-12 09:03:59     12
2022-11-12 09:04:00     13

and df2:

2022-11-12 09:03:59     22
2022-11-12 09:03:59     33
2022-11-12 09:04:00     44

Tried both join and merge with probably every combination of options and can't do that.

  • Please show us a [minimal reproducable example](https://stackoverflow.com/help/minimal-reproducible-example) and its desired output. In your example there no matches at all. What is you goal? Do you want to merge them or just compare them and cut the edges ? – Rabinzel Nov 26 '22 at 20:44
  • Edited to show the desired result. Not merging, compare and trim the edges falling out of common range. – chairman's_cat Nov 26 '22 at 21:04
  • 1
    You really should have another look at the link I provided, a MRE should be as the word says MINIMAL, provide example data for like 10 rows in each df (like my input) and also we need some data in your 1st column. The data you provided can't be run as it is. Since you want to get help, you should show some effort and provide the data as good as possible, so we can just take it and try to reproduce your problem. Here some more information on how to [ask a good question at SO](https://stackoverflow.com/help/how-to-ask) – Rabinzel Nov 26 '22 at 21:19

1 Answers1

0

New answer on the new example data:

The problem with merging here is that you have duplicated index Dates, so there can't be unambigous assignment done.

But you could do it seperately as you suggested in the beginning. You said you don't know which of both df's have start earlier or end later. Find the min value of both indexes and get the max value of these two. Same for the upper bound, get both max values and take the min value of these two values. Then you slice your df's with the lower and upper bound.

lower, upper = max(df1.index.min(), df2.index.min()), min(df1.index.max(), df2.index.max())

df1 = df1.loc[lower:upper]
print(df1)

                     val_1
2022-11-12 09:03:59     10
2022-11-12 09:03:59     11
2022-11-12 09:03:59     12
2022-11-12 09:04:00     13

df2 = df2.loc[lower:upper]
print(df2)

                     val_2
2022-11-12 09:03:59     22
2022-11-12 09:03:59     33
2022-11-12 09:04:00     44

OLD:
Since you didn't provide usable data, here my own example input data:

np.random.seed(42)
df1 = pd.DataFrame(
    {
        'A' : np.random.randint(0,10, size=10)
    },
    index= pd.date_range('2022-11-26 08:00', periods=10, freq='10T')
)

df2 = pd.DataFrame(
    {
        'B' : np.random.randint(0,10, size=10)
    },
    index= pd.date_range('2022-11-26 08:30', periods=10, freq='10T')
)

which creates this data:

#df1
                     A
2022-11-26 08:00:00  6
2022-11-26 08:10:00  3
2022-11-26 08:20:00  7
2022-11-26 08:30:00  4
2022-11-26 08:40:00  6
2022-11-26 08:50:00  9
2022-11-26 09:00:00  2
2022-11-26 09:10:00  6
2022-11-26 09:20:00  7
2022-11-26 09:30:00  4

#df2
                     B
2022-11-26 08:30:00  3
2022-11-26 08:40:00  7
2022-11-26 08:50:00  7
2022-11-26 09:00:00  2
2022-11-26 09:10:00  5
2022-11-26 09:20:00  4
2022-11-26 09:30:00  1
2022-11-26 09:40:00  7
2022-11-26 09:50:00  5
2022-11-26 10:00:00  1

I think a decent approach still would be to merge the data to find out the edges that are off. Just a offer, if you leave them merged, you could compare them directly like this:

combined = df1.merge(df2, how='inner', left_index=True, right_index=True)
combined['compare'] = np.where(combined['A']==combined['B'], 'hit', 'miss')
print(combined)

Output of combined:

                     A  B compare
2022-11-26 08:30:00  4  3    miss
2022-11-26 08:40:00  6  7    miss
2022-11-26 08:50:00  9  7    miss
2022-11-26 09:00:00  2  2     hit
2022-11-26 09:10:00  6  5    miss
2022-11-26 09:20:00  7  4    miss
2022-11-26 09:30:00  4  1    miss

If you really need them to stay seperated, just add:

df1_new = combined[['A']]
df2_new = combined[['B']]
Rabinzel
  • 7,757
  • 3
  • 10
  • 30
  • Been playing with join, should think about merge, then it does what I wanted. line: combined['compare'] gives an error too long to put in a comment, last three lines: File "/home/xyz/miniconda3/envs/spyder-env/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc raise KeyError(key) from err KeyError: 'B' But that line is not necessary for my needs. Thanks! – chairman's_cat Nov 26 '22 at 22:39
  • You get a key error because you don't have a column with the name 'B'. You need to change them to your column names – Rabinzel Nov 26 '22 at 22:44
  • Nope, this is running the above code as it is. Depending on which electron goes haywire it is A or B. – chairman's_cat Nov 26 '22 at 22:53
  • Oh, I just saw I forgot to update the code when creating df2. – Rabinzel Nov 26 '22 at 22:57
  • So merge works as it merges, but it doesn't trim my dataframes. Maybe indexing. More on that when I find out. – chairman's_cat Nov 26 '22 at 23:20
  • inner merge will trim the data to only the rows where the index is in both dataframes. Here is a very good post about different kinds of merging data: [merging101](https://stackoverflow.com/questions/53645882/pandas-merging-101). Maybe it will clarify things. – Rabinzel Nov 26 '22 at 23:24
  • Added more to the original question, hopefully it is more clear now. Both join and merge seem to work with the same frequency of the key, but not when the key is multiplied as they multiply occurrence of values. – chairman's_cat Nov 27 '22 at 11:41
  • I added a new answer for this task. – Rabinzel Nov 27 '22 at 13:07