I have two datasets. I want to return variables from the time interval dataset to time-series dataset based on specific ID and time interval.
I have already followed this link Merging two dataframes based on a date between two other dates without a common column
But it is a merged time series to time interval dataset without its own ID. I cannot adapt this code due to I have to combine ID and time at the same time! and also my poor knowledge.
So this is my data set example.
df1 is a time series consisting of a specific ID and timestamp every one hour.
df1:
data = {'ID': {0: 1111, 1: 1111, 2: 1111, 3: 1111, 4: 1111, 5: 1111, 6: 1111, 7: 1111, 8: 1111, 9: 1111, 10: 1111, 11: 1111, 12: 1111, 13: 1111, 14: 1111, 15: 1111, 16: 1111, 17: 1111, 18: 1111, 19: 1111, 20: 1111, 21: 1111, 22: 1111, 23: 1111, 24: 2222, 25: 2222, 26: 2222, 27: 2222, 28: 2222, 29: 2222, 30: 2222, 31: 2222, 32: 2222, 33: 2222, 34: 2222, 35: 2222, 36: 2222, 37: 2222, 38: 2222, 39: 2222, 40: 2222, 41: 2222, 42: 2222, 43: 2222, 44: 2222, 45: 2222, 46: 2222, 47: 2222, 48: 2222, 49: 2222, 50: 2222, 51: 2222, 52: 2222, 53: 2222, 54: 2222, 55: 2222, 56: 2222, 57: 2222, 58: 2222, 59: 2222, 60: 2222, 61: 2222, 62: 2222, 63: 2222, 64: 3333, 65: 3333, 66: 3333, 67: 3333, 68: 3333, 69: 3333, 70: 3333, 71: 3333, 72: 3333, 73: 3333, 74: 3333, 75: 3333, 76: 3333, 77: 3333, 78: 3333, 79: 3333, 80: 3333, 81: 3333, 82: 3333, 83: 3333, 84: 3333, 85: 3333, 86: 3333, 87: 3333, 88: 3333, 89: 3333, 90: 3333, 91: 3333, 92: 3333, 93: 3333, 94: 3333, 95: 3333, 96: 3333, 97: 3333, 98: 3333, 99: 3333, 100: 3333, 101: 3333}, 'date_time': {0: Timestamp('2019-04-01 00:46:00'), 1: Timestamp('2019-04-01 01:46:00'), 2: Timestamp('2019-04-01 02:46:00'), 3: Timestamp('2019-04-01 03:47:00'), 4: Timestamp('2019-04-01 04:46:00'), 5: Timestamp('2019-04-01 05:46:00'), 6: Timestamp('2019-04-01 06:46:00'), 7: Timestamp('2019-04-01 07:47:00'), 8: Timestamp('2019-04-01 08:46:00'), 9: Timestamp('2019-04-01 09:46:00'), 10: Timestamp('2019-04-01 10:46:00'), 11: Timestamp('2019-04-01 11:47:00'), 12: Timestamp('2019-04-01 12:46:00'), 13: Timestamp('2019-04-01 13:46:00'), 14: Timestamp('2019-04-01 14:46:00'), 15: Timestamp('2019-04-01 15:47:00'), 16: Timestamp('2019-04-01 16:46:00'), 17: Timestamp('2019-04-01 17:46:00'), 18: Timestamp('2019-04-01 18:46:00'), 19: Timestamp('2019-04-01 19:47:00'), 20: Timestamp('2019-04-01 20:46:00'), 21: Timestamp('2019-04-01 21:46:00'), 22: Timestamp('2019-04-01 22:46:00'), 23: Timestamp('2019-04-01 23:46:00'), 24: Timestamp('2019-01-12 16:58:00'), 25: Timestamp('2019-01-12 17:58:00'), 26: Timestamp('2019-01-12 18:58:00'), 27: Timestamp('2019-01-12 19:58:00'), 28: Timestamp('2019-01-12 20:58:00'), 29: Timestamp('2019-01-12 21:58:00'), 30: Timestamp('2019-01-12 22:58:00'), 31: Timestamp('2019-01-12 23:58:00'), 32: Timestamp('2019-01-13 00:58:00'), 33: Timestamp('2019-01-13 01:58:00'), 34: Timestamp('2019-01-13 02:58:00'), 35: Timestamp('2019-01-13 03:58:00'), 36: Timestamp('2019-01-13 04:58:00'), 37: Timestamp('2019-01-13 05:58:00'), 38: Timestamp('2019-01-13 06:58:00'), 39: Timestamp('2019-01-13 07:58:00'), 40: Timestamp('2019-01-13 08:58:00'), 41: Timestamp('2019-01-13 09:58:00'), 42: Timestamp('2019-01-13 10:58:00'), 43: Timestamp('2019-01-13 11:58:00'), 44: Timestamp('2019-01-13 12:58:00'), 45: Timestamp('2019-01-13 13:58:00'), 46: Timestamp('2019-01-13 14:58:00'), 47: Timestamp('2019-01-13 15:58:00'), 48: Timestamp('2019-01-13 16:58:00'), 49: Timestamp('2019-01-13 17:58:00'), 50: Timestamp('2019-01-13 18:58:00'), 51: Timestamp('2019-01-13 19:58:00'), 52: Timestamp('2019-01-13 20:58:00'), 53: Timestamp('2019-01-13 21:58:00'), 54: Timestamp('2019-01-13 22:58:00'), 55: Timestamp('2019-01-13 23:58:00'), 56: Timestamp('2019-01-14 00:58:00'), 57: Timestamp('2019-01-14 01:58:00'), 58: Timestamp('2019-01-14 02:58:00'), 59: Timestamp('2019-01-14 03:58:00'), 60: Timestamp('2019-01-14 04:58:00'), 61: Timestamp('2019-01-14 05:58:00'), 62: Timestamp('2019-01-14 06:58:00'), 63: Timestamp('2019-01-16 16:58:00'), 64: Timestamp('2019-07-06 13:24:00'), 65: Timestamp('2019-07-06 14:24:00'), 66: Timestamp('2019-07-06 15:24:00'), 67: Timestamp('2019-07-06 16:24:00'), 68: Timestamp('2019-07-06 17:24:00'), 69: Timestamp('2019-07-06 18:24:00'), 70: Timestamp('2019-07-06 19:24:00'), 71: Timestamp('2019-07-06 20:24:00'), 72: Timestamp('2019-07-06 21:24:00'), 73: Timestamp('2019-07-06 22:24:00'), 74: Timestamp('2019-07-06 23:24:00'), 75: Timestamp('2019-07-07 00:24:00'), 76: Timestamp('2019-07-07 01:24:00'), 77: Timestamp('2019-07-07 02:24:00'), 78: Timestamp('2019-07-07 03:24:00'), 79: Timestamp('2019-07-07 04:24:00'), 80: Timestamp('2019-07-07 05:24:00'), 81: Timestamp('2019-07-07 06:24:00'), 82: Timestamp('2019-07-07 07:24:00'), 83: Timestamp('2019-07-07 08:24:00'), 84: Timestamp('2019-07-07 09:24:00'), 85: Timestamp('2019-07-07 10:24:00'), 86: Timestamp('2019-07-07 11:24:00'), 87: Timestamp('2019-07-07 12:24:00'), 88: Timestamp('2019-07-07 13:24:00'), 89: Timestamp('2019-07-07 14:24:00'), 90: Timestamp('2019-07-07 15:24:00'), 91: Timestamp('2019-07-08 13:24:00'), 92: Timestamp('2019-07-08 14:24:00'), 93: Timestamp('2019-07-08 15:24:00'), 94: Timestamp('2019-07-08 16:24:00'), 95: Timestamp('2019-07-08 17:24:00'), 96: Timestamp('2019-07-08 18:24:00'), 97: Timestamp('2019-07-08 19:24:00'), 98: Timestamp('2019-07-08 20:24:00'), 99: Timestamp('2019-07-08 21:24:00'), 100: Timestamp('2019-07-08 22:24:00'), 101: Timestamp('2019-07-08 23:24:00')}}
df = pd.DataFrame(data)
ID date_time
0 1111 2019-04-01 00:46:00
1 1111 2019-04-01 01:46:00
2 1111 2019-04-01 02:46:00
3 1111 2019-04-01 03:47:00
4 1111 2019-04-01 04:46:00
.. ... ...
97 3333 2019-07-08 19:24:00
98 3333 2019-07-08 20:24:00
99 3333 2019-07-08 21:24:00
100 3333 2019-07-08 22:24:00
101 3333 2019-07-08 23:24:00
[102 rows x 2 columns]
df2 is time interval data with a specific ID (same ID as df1)including "id" "start" and "end" time and "variable_1".
df2:
data = {'id': {0: 1111, 1: 1111, 2: 2222, 3: 2222, 4: 3333, 5: 3333}, 'start': {0: Timestamp('2019-04-01 02:46:00'), 1: Timestamp('2019-04-01 13:46:00'), 2: Timestamp('2019-01-12 18:58:00'), 3: Timestamp('2019-01-13 05:58:00'), 4: Timestamp('2019-07-06 16:24:00'), 5: Timestamp('2019-07-07 03:24:00')}, 'end': {0: Timestamp('2019-04-01 05:46:00'), 1: Timestamp('2019-04-01 21:46:00'), 2: Timestamp('2019-01-13 00:58:00'), 3: Timestamp('2019-01-13 16:58:00'), 4: Timestamp('2019-07-06 23:24:00'), 5: Timestamp('2019-07-08 14:24:00')}, 'variable_1': {0: 102, 1: 103, 2: 104, 3: 105, 4: 106, 5: 107}}
df2 = pd.DataFrame(data)
id start end variable_1
0 1111 2019-04-01 02:46:00 2019-04-01 05:46:00 102
1 1111 2019-04-01 13:46:00 2019-04-01 21:46:00 103
2 2222 2019-01-12 18:58:00 2019-01-13 00:58:00 104
3 2222 2019-01-13 05:58:00 2019-01-13 16:58:00 105
4 3333 2019-07-06 16:24:00 2019-07-06 23:24:00 106
5 3333 2019-07-07 03:24:00 2019-07-08 14:24:00 107
I want to return variable_1 from df2 to df1 at every time within the interval of df2 matching using the same ID.
similar to this output:
I have tried using vlookup
function in excel.
There was some initial success, but I cannot match every ID and did lose some data.
Currently, I store my huge data (in millions) sets as CSV files.
I've seen people use pandas
, but I haven't found one that joins interval data to time series data using start and ends time together with a specific ID at the same time.