0

I am struggling to join data from an interval dataframe to a time series dataframe. For each row of my time series, I want to look in which interval it is included and return a specific value from the interval dataframe.

I got inspired by this solution : How to join two dataframes for which column values are within a certain range?

But it is not working for a too complex reason from my knowledge.

Here is my error message :

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <module>
      1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
----> 2 data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwargs)
   4355         dtype: float64
   4356         """
-> 4357         return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
   4358 
   4359     def _reduce(

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
   1041             return self.apply_str()
   1042 
-> 1043         return self.apply_standard()
   1044 
   1045     def agg(self):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
   1097                 # List[Union[Callable[..., Any], str]]]]]"; expected
   1098                 # "Callable[[Any], Any]"
-> 1099                 mapped = lib.map_infer(
   1100                     values,
   1101                     f,  # type: ignore[arg-type]

~\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <lambda>(x)
      1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Heure début réelle'],df_test['Hre fin réelle'],closed='both')
----> 2 data_test['Designation'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Désignation article'])

~\Anaconda3\lib\site-packages\pandas\core\indexes\interval.py in get_loc(self, key, method, tolerance)
    631         matches = mask.sum()
    632         if matches == 0:
--> 633             raise KeyError(key)
    634         elif matches == 1:
    635             return mask.argmax()

KeyError: Timestamp('2021-10-23 23:59:29')

The function I want to succeed .

df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])

Sample value for data_test

{'Ordre': {92: 3149484,
  93: 3149484,
  94: 3149484,
  95: 3149610,
  96: 3149610,
  97: 3149610,
  98: 3149610,
  99: 3149610,
  100: 3149610,
  101: 3149610,
  102: 3149611},
 'Start': {92: Timestamp('2021-10-26 06:55:00'),
  93: Timestamp('2021-10-26 06:55:00'),
  94: Timestamp('2021-10-26 06:55:00'),
  95: Timestamp('2021-10-26 07:25:00'),
  96: Timestamp('2021-10-26 07:25:00'),
  97: Timestamp('2021-10-26 07:25:00'),
  98: Timestamp('2021-10-26 08:30:00'),
  99: Timestamp('2021-10-26 08:30:00'),
  100: Timestamp('2021-10-26 08:30:00'),
  101: Timestamp('2021-10-26 08:30:00'),
  102: Timestamp('2021-10-26 11:37:00')},
 'End': {92: Timestamp('2021-10-26 07:25:00'),
  93: Timestamp('2021-10-26 07:25:00'),
  94: Timestamp('2021-10-26 07:25:00'),
  95: Timestamp('2021-10-26 08:30:00'),
  96: Timestamp('2021-10-26 08:30:00'),
  97: Timestamp('2021-10-26 08:30:00'),
  98: Timestamp('2021-10-26 11:37:00'),
  99: Timestamp('2021-10-26 11:37:00'),
  100: Timestamp('2021-10-26 11:37:00'),
  101: Timestamp('2021-10-26 11:37:00'),
  102: Timestamp('2021-10-26 12:30:00')},
 'Product': {92: 'Product_1',
  93: 'Product_1',
  94: 'Product_1',
  95: 'Product_2',
  96: 'Product_2',
  97: 'Product_2',
  98: 'Product_2',
  99: 'Product_2',
  100: 'Product_2',
  101: 'Product_2',
  102: 'Product_2'}}

Sample value for df_test

{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
  Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
 'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
  Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
  Timestamp('2021-10-26 08:29:29'): 66.0404281616211}}

The output would be a new column that indicate which product is concerned regarding timestamp included or not in the interval :

{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
  Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
 'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
  Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
  Timestamp('2021-10-26 08:29:29'): 66.0404281616211},
'Product': {Timestamp('2021-10-26 06:55:29'): 'Product_1',
  Timestamp('2021-10-26 06:56:29'): 'Product_1',
  Timestamp('2021-10-26 06:57:29'): 'Product_1',
  Timestamp('2021-10-26 06:58:29'): 'Product_1',
  Timestamp('2021-10-26 06:59:29'): 'Product_1',
  Timestamp('2021-10-26 08:25:29'): 'Product_2',
  Timestamp('2021-10-26 08:26:29'): 'Product_2',
  Timestamp('2021-10-26 08:27:29'): 'Product_2',
  Timestamp('2021-10-26 08:28:29'): 'Product_2',
  Timestamp('2021-10-26 08:29:29'): 'Product_2'}}

NEW SET OF DATA

data_test = {'Ordre': {53: 3147783, 54: 3147783, 55: 3147783, 56: 3147783, 57: 3147783},
 'Start': {53: Timestamp('2021-10-24 20:35:00'),
  54: Timestamp('2021-10-24 20:35:00'),
  55: Timestamp('2021-10-25 00:01:00'),
  56: Timestamp('2021-10-25 00:01:00'),
  57: Timestamp('2021-10-25 00:01:00')},
 'End': {53: Timestamp('2021-10-24 23:59:00'),
  54: Timestamp('2021-10-24 23:59:00'),
  55: Timestamp('2021-10-25 04:27:00'),
  56: Timestamp('2021-10-25 04:27:00'),
  57: Timestamp('2021-10-25 04:27:00')},
 'Product': {53: 'Product_1',
  54: 'Product_1',
  55: 'Product_1',
  56: 'Product_1',
  57: 'Product_1'}}

df_test = {'Temperature_1': {Timestamp('2021-10-24 23:55:00'): 48.0,
  Timestamp('2021-10-24 23:56:00'): 48.0,
  Timestamp('2021-10-24 23:57:00'): 48.0,
  Timestamp('2021-10-24 23:58:00'): 48.0,
  Timestamp('2021-10-24 23:59:00'): 48.0,
  Timestamp('2021-10-25 00:00:00'): 48.0,
  Timestamp('2021-10-25 00:01:00'): 48.0,
  Timestamp('2021-10-25 00:02:00'): 48.0},
 'Temperature_2': {Timestamp('2021-10-24 23:55:00'): 60.0,
  Timestamp('2021-10-24 23:56:00'): 60.0,
  Timestamp('2021-10-24 23:57:00'): 60.0,
  Timestamp('2021-10-24 23:58:00'): 60.0,
  Timestamp('2021-10-24 23:59:00'): 60.0,
  Timestamp('2021-10-25 00:00:00'): 59.0,
  Timestamp('2021-10-25 00:01:00'): 59.0,
  Timestamp('2021-10-25 00:02:00'): 59.0}}

Thank you for your help and advice

Gaetan
  • 5
  • 3
  • Thanks for the advice, I added one – Gaetan Dec 01 '21 at 17:58
  • 1
    Here is the original data for data_test, I also added the pd.IntervalIndex.from_arrays function I use to create the dataframes with Interval. Thanks – Gaetan Dec 02 '21 at 07:44

1 Answers1

0

The intervals should be created on the data_test, not df_test. Also, your data_test has duplicates:

data_test = data_test.drop_duplicates()
data_test.index = pd.IntervalIndex.from_arrays(data_test['Start'],
                                               data_test['End'],
                                               closed='both')

product = (df_test
           .index
           .to_series()
           .apply(lambda df: data_test.iloc[data_test.index.get_loc(df), 
                                            data_test.columns.get_loc('Product')])
          )

df_test.assign(Product = product)
 
Temperature_1  Temperature_2    Product
2021-10-26 06:55:29      62.990524      66.829086  Product_1
2021-10-26 06:56:29      62.990524      66.829086  Product_1
2021-10-26 06:57:29      62.990524      66.829086  Product_1
2021-10-26 06:58:29      62.990524      66.829086  Product_1
2021-10-26 06:59:29      62.990524      66.829086  Product_1
2021-10-26 08:25:29      65.061195      67.044952  Product_2
2021-10-26 08:26:29      65.061195      67.044952  Product_2
2021-10-26 08:27:29      65.061195      67.044952  Product_2
2021-10-26 08:28:29      65.061195      66.040428  Product_2
2021-10-26 08:29:29      65.061195      66.040428  Product_2

For the updated data, you are right that if a value is not within the intervals it will fail. There are alternatives to solve this:

One option is to use conditional_join from pyjanitor, which can help to abstract inequality joins:

# pip install pyjanitor
import pandas as pd
import janitor

data_test = pd.DataFrame(data_test)

df_test = pd.DataFrame(df_test)


df_test.index.name = 'Timestamp'

(df_test
  .reset_index()
  .conditional_join(
       data_test, 
       ('Timestamp', 'Start', '>='), 
       ('Timestamp', 'End', '<='), how = 'left')
  .loc[:, ['Timestamp', 'Temperature_1', 'Temperature_2', 'Product']]
  .set_index('Timestamp')
) 
                     Temperature_1  Temperature_2    Product
Timestamp                                                   
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-25 00:00:00           48.0           59.0        NaN
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1

The other option involves IntervalIndex; however, instead of apply, we use a for loop (apply is sort of a for loop):

# start afresh

data_test = pd.DataFrame(data_test)

df_test = pd.DataFrame(df_test)

# build the intervals
intervals = pd.IntervalIndex.from_arrays(data_test['Start'],
                                         data_test['End'],
                                         closed='both')

data_test.index = intervals

values = {}

# create dictionary of values found in the intervals
for val in df_test.index:
    present = intervals.contains(val)
    if present.any(): # we found something!
        values[val] = intervals[present]

values = pd.Series(values).explode()

# reindex and create a temporary column
df_test.loc[values.index, 'intervals'] = values.array

# use the temporary column to merge
(df_test
  .merge(data_test.Product, 
         left_on='intervals', 
         right_index = True, 
         how = 'left')
  .drop(columns='intervals')
)

                     Temperature_1  Temperature_2    Product
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-25 00:00:00           48.0           59.0        NaN
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thank you for your help ! I am confident it works in normal conditions, except I have a similar KeyError for another timestamp. I dig into the data and realised that this specific timestamp is not include in any of the intervals. Being out of range from the intervals can throw an error like this ? – Gaetan Dec 02 '21 at 14:38
  • Kindly update your sample dataframe to include some that are not within the interval – sammywemmy Dec 02 '21 at 19:18
  • New sample of data at the end of my post. Thank you for your support ! – Gaetan Dec 03 '21 at 08:01
  • 1
    Wow ! Both work perfectly ! The second needs more time to process the data, the for loop maybe ? Since it's only a small amount of data, it will process more in the future, I guess I will go for the one with Pyjanitor. Thank you a lot for your help ! – Gaetan Dec 03 '21 at 16:57