1

I have two datasets. I want to return variables from the time interval dataset to time-series dataset based on specific ID and time interval.

I have already followed this link Merging two dataframes based on a date between two other dates without a common column

But it is a merged time series to time interval dataset without its own ID. I cannot adapt this code due to I have to combine ID and time at the same time! and also my poor knowledge.

So this is my data set example.

df1 is a time series consisting of a specific ID and timestamp every one hour.

df1:

data = {'ID': {0: 1111, 1: 1111, 2: 1111, 3: 1111, 4: 1111, 5: 1111, 6: 1111, 7: 1111, 8: 1111, 9: 1111, 10: 1111, 11: 1111, 12: 1111, 13: 1111, 14: 1111, 15: 1111, 16: 1111, 17: 1111, 18: 1111, 19: 1111, 20: 1111, 21: 1111, 22: 1111, 23: 1111, 24: 2222, 25: 2222, 26: 2222, 27: 2222, 28: 2222, 29: 2222, 30: 2222, 31: 2222, 32: 2222, 33: 2222, 34: 2222, 35: 2222, 36: 2222, 37: 2222, 38: 2222, 39: 2222, 40: 2222, 41: 2222, 42: 2222, 43: 2222, 44: 2222, 45: 2222, 46: 2222, 47: 2222, 48: 2222, 49: 2222, 50: 2222, 51: 2222, 52: 2222, 53: 2222, 54: 2222, 55: 2222, 56: 2222, 57: 2222, 58: 2222, 59: 2222, 60: 2222, 61: 2222, 62: 2222, 63: 2222, 64: 3333, 65: 3333, 66: 3333, 67: 3333, 68: 3333, 69: 3333, 70: 3333, 71: 3333, 72: 3333, 73: 3333, 74: 3333, 75: 3333, 76: 3333, 77: 3333, 78: 3333, 79: 3333, 80: 3333, 81: 3333, 82: 3333, 83: 3333, 84: 3333, 85: 3333, 86: 3333, 87: 3333, 88: 3333, 89: 3333, 90: 3333, 91: 3333, 92: 3333, 93: 3333, 94: 3333, 95: 3333, 96: 3333, 97: 3333, 98: 3333, 99: 3333, 100: 3333, 101: 3333}, 'date_time': {0: Timestamp('2019-04-01 00:46:00'), 1: Timestamp('2019-04-01 01:46:00'), 2: Timestamp('2019-04-01 02:46:00'), 3: Timestamp('2019-04-01 03:47:00'), 4: Timestamp('2019-04-01 04:46:00'), 5: Timestamp('2019-04-01 05:46:00'), 6: Timestamp('2019-04-01 06:46:00'), 7: Timestamp('2019-04-01 07:47:00'), 8: Timestamp('2019-04-01 08:46:00'), 9: Timestamp('2019-04-01 09:46:00'), 10: Timestamp('2019-04-01 10:46:00'), 11: Timestamp('2019-04-01 11:47:00'), 12: Timestamp('2019-04-01 12:46:00'), 13: Timestamp('2019-04-01 13:46:00'), 14: Timestamp('2019-04-01 14:46:00'), 15: Timestamp('2019-04-01 15:47:00'), 16: Timestamp('2019-04-01 16:46:00'), 17: Timestamp('2019-04-01 17:46:00'), 18: Timestamp('2019-04-01 18:46:00'), 19: Timestamp('2019-04-01 19:47:00'), 20: Timestamp('2019-04-01 20:46:00'), 21: Timestamp('2019-04-01 21:46:00'), 22: Timestamp('2019-04-01 22:46:00'), 23: Timestamp('2019-04-01 23:46:00'), 24: Timestamp('2019-01-12 16:58:00'), 25: Timestamp('2019-01-12 17:58:00'), 26: Timestamp('2019-01-12 18:58:00'), 27: Timestamp('2019-01-12 19:58:00'), 28: Timestamp('2019-01-12 20:58:00'), 29: Timestamp('2019-01-12 21:58:00'), 30: Timestamp('2019-01-12 22:58:00'), 31: Timestamp('2019-01-12 23:58:00'), 32: Timestamp('2019-01-13 00:58:00'), 33: Timestamp('2019-01-13 01:58:00'), 34: Timestamp('2019-01-13 02:58:00'), 35: Timestamp('2019-01-13 03:58:00'), 36: Timestamp('2019-01-13 04:58:00'), 37: Timestamp('2019-01-13 05:58:00'), 38: Timestamp('2019-01-13 06:58:00'), 39: Timestamp('2019-01-13 07:58:00'), 40: Timestamp('2019-01-13 08:58:00'), 41: Timestamp('2019-01-13 09:58:00'), 42: Timestamp('2019-01-13 10:58:00'), 43: Timestamp('2019-01-13 11:58:00'), 44: Timestamp('2019-01-13 12:58:00'), 45: Timestamp('2019-01-13 13:58:00'), 46: Timestamp('2019-01-13 14:58:00'), 47: Timestamp('2019-01-13 15:58:00'), 48: Timestamp('2019-01-13 16:58:00'), 49: Timestamp('2019-01-13 17:58:00'), 50: Timestamp('2019-01-13 18:58:00'), 51: Timestamp('2019-01-13 19:58:00'), 52: Timestamp('2019-01-13 20:58:00'), 53: Timestamp('2019-01-13 21:58:00'), 54: Timestamp('2019-01-13 22:58:00'), 55: Timestamp('2019-01-13 23:58:00'), 56: Timestamp('2019-01-14 00:58:00'), 57: Timestamp('2019-01-14 01:58:00'), 58: Timestamp('2019-01-14 02:58:00'), 59: Timestamp('2019-01-14 03:58:00'), 60: Timestamp('2019-01-14 04:58:00'), 61: Timestamp('2019-01-14 05:58:00'), 62: Timestamp('2019-01-14 06:58:00'), 63: Timestamp('2019-01-16 16:58:00'), 64: Timestamp('2019-07-06 13:24:00'), 65: Timestamp('2019-07-06 14:24:00'), 66: Timestamp('2019-07-06 15:24:00'), 67: Timestamp('2019-07-06 16:24:00'), 68: Timestamp('2019-07-06 17:24:00'), 69: Timestamp('2019-07-06 18:24:00'), 70: Timestamp('2019-07-06 19:24:00'), 71: Timestamp('2019-07-06 20:24:00'), 72: Timestamp('2019-07-06 21:24:00'), 73: Timestamp('2019-07-06 22:24:00'), 74: Timestamp('2019-07-06 23:24:00'), 75: Timestamp('2019-07-07 00:24:00'), 76: Timestamp('2019-07-07 01:24:00'), 77: Timestamp('2019-07-07 02:24:00'), 78: Timestamp('2019-07-07 03:24:00'), 79: Timestamp('2019-07-07 04:24:00'), 80: Timestamp('2019-07-07 05:24:00'), 81: Timestamp('2019-07-07 06:24:00'), 82: Timestamp('2019-07-07 07:24:00'), 83: Timestamp('2019-07-07 08:24:00'), 84: Timestamp('2019-07-07 09:24:00'), 85: Timestamp('2019-07-07 10:24:00'), 86: Timestamp('2019-07-07 11:24:00'), 87: Timestamp('2019-07-07 12:24:00'), 88: Timestamp('2019-07-07 13:24:00'), 89: Timestamp('2019-07-07 14:24:00'), 90: Timestamp('2019-07-07 15:24:00'), 91: Timestamp('2019-07-08 13:24:00'), 92: Timestamp('2019-07-08 14:24:00'), 93: Timestamp('2019-07-08 15:24:00'), 94: Timestamp('2019-07-08 16:24:00'), 95: Timestamp('2019-07-08 17:24:00'), 96: Timestamp('2019-07-08 18:24:00'), 97: Timestamp('2019-07-08 19:24:00'), 98: Timestamp('2019-07-08 20:24:00'), 99: Timestamp('2019-07-08 21:24:00'), 100: Timestamp('2019-07-08 22:24:00'), 101: Timestamp('2019-07-08 23:24:00')}}

df = pd.DataFrame(data)

       ID           date_time
0    1111 2019-04-01 00:46:00
1    1111 2019-04-01 01:46:00
2    1111 2019-04-01 02:46:00
3    1111 2019-04-01 03:47:00
4    1111 2019-04-01 04:46:00
..    ...                 ...
97   3333 2019-07-08 19:24:00
98   3333 2019-07-08 20:24:00
99   3333 2019-07-08 21:24:00
100  3333 2019-07-08 22:24:00
101  3333 2019-07-08 23:24:00

[102 rows x 2 columns]

df2 is time interval data with a specific ID (same ID as df1)including "id" "start" and "end" time and "variable_1".

df2:

data = {'id': {0: 1111, 1: 1111, 2: 2222, 3: 2222, 4: 3333, 5: 3333}, 'start': {0: Timestamp('2019-04-01 02:46:00'), 1: Timestamp('2019-04-01 13:46:00'), 2: Timestamp('2019-01-12 18:58:00'), 3: Timestamp('2019-01-13 05:58:00'), 4: Timestamp('2019-07-06 16:24:00'), 5: Timestamp('2019-07-07 03:24:00')}, 'end': {0: Timestamp('2019-04-01 05:46:00'), 1: Timestamp('2019-04-01 21:46:00'), 2: Timestamp('2019-01-13 00:58:00'), 3: Timestamp('2019-01-13 16:58:00'), 4: Timestamp('2019-07-06 23:24:00'), 5: Timestamp('2019-07-08 14:24:00')}, 'variable_1': {0: 102, 1: 103, 2: 104, 3: 105, 4: 106, 5: 107}}

df2 = pd.DataFrame(data)
   
     id               start                 end  variable_1
0  1111 2019-04-01 02:46:00 2019-04-01 05:46:00         102
1  1111 2019-04-01 13:46:00 2019-04-01 21:46:00         103
2  2222 2019-01-12 18:58:00 2019-01-13 00:58:00         104
3  2222 2019-01-13 05:58:00 2019-01-13 16:58:00         105
4  3333 2019-07-06 16:24:00 2019-07-06 23:24:00         106
5  3333 2019-07-07 03:24:00 2019-07-08 14:24:00         107

I want to return variable_1 from df2 to df1 at every time within the interval of df2 matching using the same ID.

similar to this output:

Output

I have tried using vlookup function in excel. There was some initial success, but I cannot match every ID and did lose some data.

Currently, I store my huge data (in millions) sets as CSV files. I've seen people use pandas, but I haven't found one that joins interval data to time series data using start and ends time together with a specific ID at the same time.

BeRT2me
  • 12,699
  • 2
  • 13
  • 31
Meeana C.
  • 21
  • 3
  • I have checked some solutions to the previous question. But the problem is I need to match using ID and time interval. this link https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common is quite similar. But I merge the time intervals dataset into the time-series dataset. – Meeana C. Jul 11 '22 at 12:58
  • Please add your sample dataframes in text format (`df.to_dict()`). Also are there any intervals in df2 which might overlaps within a same id? – Shubham Sharma Jul 11 '22 at 13:37
  • I have already posted sample dataframes in text. According to your question "Also are there any intervals in df2 which might overlaps within the same id?" The answer is no time overlap within the same ID. – Meeana C. Jul 11 '22 at 14:38

1 Answers1

1

Code

i = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='both')
s = df2.set_index(['id', i])['variable_1']

df1['variable_1'] = df1.set_index(['ID', 'date_time']).index.map(s)

How this works

  • Create a interval index from the start and end columns in df2
  • Create a mapping series (s) by setting the index of df2 to id and interval index and selecting column variable_1
  • Use the mapping series along with MultiIndex.map to substitute the values in df1

Result

       ID           date_time  variable_1
0    1111 2019-04-01 00:46:00         NaN
1    1111 2019-04-01 01:46:00         NaN
2    1111 2019-04-01 02:46:00       102.0
3    1111 2019-04-01 03:47:00       102.0
4    1111 2019-04-01 04:46:00       102.0
5    1111 2019-04-01 05:46:00       102.0
6    1111 2019-04-01 06:46:00         NaN
...
11   1111 2019-04-01 11:47:00         NaN
12   1111 2019-04-01 12:46:00         NaN
13   1111 2019-04-01 13:46:00       103.0
...
32   2222 2019-01-13 00:58:00       104.0
33   2222 2019-01-13 01:58:00         NaN
34   2222 2019-01-13 02:58:00         NaN
35   2222 2019-01-13 03:58:00         NaN
36   2222 2019-01-13 04:58:00         NaN
37   2222 2019-01-13 05:58:00       105.0
...
92   3333 2019-07-08 14:24:00       107.0
...
101  3333 2019-07-08 23:24:00         NaN
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • I have tried in real data already. It showed "InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique" I think there are some overlaps in the interval dataset, so I check and remove some duplicates. It still did not work! Do you have any suggestions about this? – Meeana C. Jul 12 '22 at 06:45
  • @MeeanaC. Let me get back to you in sometime. – Shubham Sharma Jul 12 '22 at 11:49
  • Hey @MeeanaC. Just wanted to check if you are still having that error? – Shubham Sharma Jul 13 '22 at 16:03
  • Yes I am still fixing it. I am trying to clean the data. But it is still error. – Meeana C. Jul 17 '22 at 00:25