5

I have a dataframe, df_A with two columns 'amin' and 'amax', which is a set of time range.

My objective is to find whether a column in df_B lies between any of the rows of range in df_A 'amin' and 'amax' columns.

df_A[['amin','amax'] ]

                  amin                   amax
          0 2016-07-16 19:37:03   2016-07-17 11:16:32
          1 2016-07-04 21:15:54   2016-07-05 10:57:46
          2 2016-07-24 23:30:41   2016-07-25 15:38:02
          3 2016-07-12 03:02:38   2016-07-12 22:11:01

df_B['created_date']

      created_date
   2016-07-17 01:16:32 
   2016-07-05 10:15:54  
   2016-07-12 12:11:01


df_A['amin'] = pd.to_datetime(df_A['amin'], errors='coerce')
df_A['amax'] = pd.to_datetime(df_A['amax'], errors='coerce')
df_B['created_date'] = pd.to_datetime(df_B['created_date'],errors='coerce')

def dt2epoch(value):
   epoch = (value - pd.to_datetime(datetime(2015,12,31).strftime('%Y-%m-%d %H:%M:%S.%f'))).total_seconds()
   return epoch    

df_A['amax_epoch']=df_A['amax'].apply(dt2epoch)
df_A['amin_epoch']=df_A['amin'].apply(dt2epoch)
df_B['created_date_epoch']=df_B['created_date'].apply(dt2epoch)


def make_tuple(row):
     n= len(row)
     row = [(x,row[n - 1]) for x in row]
     return row

minMaxTuple = minMax.apply(make_tuple, axis =1)

Above is part of my code, I've tried below(not sure if it's necessary):

  1. convert them to epoch values
  2. convert df_A into a tuple.

However, df_A and df_B has a different number of rows. Also, I dont have any id column to merge them together.

label = []

for l in df_B['created_date_epoch']:

    if (m[0] for m in minMaxTuple) <= l <= (m[1] for m in minMaxTuple):
        label.append('1')
    else:
        label.append('0')

However, when I run this, the result I get for 'label' is an empty list.

Also, the label should be a column that has the same number of rows as df_A.

Ultimately, I would like to add a new 'label' column in df_A:

                              minMaxTuple                      label
            (2016-07-16 19:37:03, 2016-07-17 11:16:32)            1
            (2016-07-04 21:15:54, 2016-07-05 10:57:46)            1 
            (2016-07-24 23:30:41, 2016-07-25 15:38:02)            0
            (2016-07-12 03:02:38, 2016-07-12 22:11:01)            1
jynn
  • 95
  • 2
  • 9
  • Would you ever have data that would conflict with each other? meaning you have multiple `created_date`s that would fit into another `amin`/`amax` combo? – MattR Dec 08 '17 at 13:37
  • @MattR hmm yes this might be one of the concerns also. if this happened, do you think it will be better to just label 1 and 0 to indicate? rather than storing the original data? – jynn Dec 08 '17 at 13:39
  • what kind of data is this? this may not be a `pandas` thing. it may be a general data thing. Are there any *unique* identifiers that could be used? Before we start delving into code there may be a less-painful solution – MattR Dec 08 '17 at 13:41
  • @MattR do you mean unique key? cause they are from different table hence there isnt any. df_A is the result of clustering a series of time series data and amin amax are the min and max time of that particular cluster. while df_B is the date of an event happens that I want to label df_A data. – jynn Dec 08 '17 at 13:44
  • unique key, unique identifier. Same thing in my book. So what would be the expected result if there were two possible values in `created_date`? My fear is after this question is solved, and you will need to ask another one because the scenario I'm talking about will arise – MattR Dec 08 '17 at 13:46
  • @MattR cause my goal is to label if within that time range, did any event happen in df_B. hence at the end I will use '1' and '0' to label. I might edit my question, if this is confusing. – jynn Dec 08 '17 at 13:48
  • IIUC, If that is your goal, a duplicate should not matter since you only want to signal if an event happened. Instead of needing the date, you could use a binary `[1 or 0]`? is that correct? – MattR Dec 08 '17 at 13:51
  • @MattR yes! correct! whether a duplicate, or there's another data that also falls into that range will not be a matter to me. – jynn Dec 08 '17 at 13:51
  • perfect - I think this is a much more viable option. and thank you for editing the question to accommodate this change! – MattR Dec 08 '17 at 13:53
  • @MattR thanks for raising this as well. What would be your advice to tackle this problem? – jynn Dec 08 '17 at 14:06

4 Answers4

1

Very similar to @dubbbdan 's answer but maybe simpler using any and the and operator:

any_in_range = lambda row, iterable: any(
    [(row[0] < x) & (x < row[1]) for x in iterable])
df_A['label'] = df_A.apply(any_in_range, iterable=df_B['created_date'], axis=1)
print df_A

Prints:

                 amin                amax  label
0 2016-07-16 19:37:03 2016-07-17 11:16:32   True
1 2016-07-04 21:15:54 2016-07-05 10:57:46   True
2 2016-07-24 23:30:41 2016-07-25 15:38:02  False
3 2016-07-12 03:02:38 2016-07-12 22:11:01   True
Delforge
  • 792
  • 7
  • 17
0

One solution would be to see if a created_date in df_b falls between an amin and amax would be to use boolean logic. In a row-wise calculation for each row in df_a you could use the following logic:

if sum((row['amin'] > df_b['created_date']) | (row['amax'] < df_b['created_date'])) == len(df_b)

In this stament I am using the logical operator | to check if amin is less than created_date OR if amax is less than created_date. If the statement is True you could conclude that a created date does not fall between the time period created by amin and amax. If none of the created_dates fall between the period created by amin and amax, you could then assign a value of 0 to df_a['label']: Something like:

import pandas as pd
from StringIO import StringIO

def myfunc(row, df_b):
    if sum((row['amin'] > df_b['created_date']) | (row['amax'] < df_b['created_date'])) == len(df_b):
        return 0
    else:
        return 1

a_str= """
amin,amax
2016-07-16 19:37:03,2016-07-17 11:16:32
2016-07-04 21:15:54,2016-07-05 10:57:46
2016-07-24 23:30:41,2016-07-25 15:38:02
2016-07-12 03:02:38,2016-07-12 22:11:01"""

b_str = """
created_date
2016-07-17 01:16:32 
2016-07-05 10:15:54  
2016-07-12 12:11:01"""
df_a = pd.read_csv(StringIO(a_str), sep=',')
df_b = pd.read_csv(StringIO(b_str), sep=',')

#Convert to datetime
df_a['amin'] = pd.to_datetime(df_a['amin'])
df_a['amax'] = pd.to_datetime(df_a['amax'])
df_b['created_date'] = pd.to_datetime(df_b['created_date'])

df_a['label'] = df_a.apply(lambda x: myfunc(x,df_b), axis=1)

Which returns a column label in df_a with the expected output of:

                 amin                amax  label
0 2016-07-16 19:37:03 2016-07-17 11:16:32      1
1 2016-07-04 21:15:54 2016-07-05 10:57:46      1
2 2016-07-24 23:30:41 2016-07-25 15:38:02      0
3 2016-07-12 03:02:38 2016-07-12 22:11:01      1
dubbbdan
  • 2,650
  • 1
  • 25
  • 43
  • hi! this works perfectly! But can I ask why have t o check if the 'sum' of the statement is '==' to the 'len(df_B)'? @dubbbdan – jynn Dec 09 '17 at 01:58
  • @jayen I chose the OR logic (`|`) since only one condition needs to be false for a `created_date` to be not within the time period created by `amin` and `amax`. A sum treats `True` as the value of 1. So if, all elemets series are `True` (none of the dates are within the time span) it will be equal to the length of the series `created_date`. – dubbbdan Dec 09 '17 at 03:53
  • @jayen if the logical statement returns `True`, you can conclude none of the created dates fall within the time period created by `amin` and `amax`. Hope this helps! – dubbbdan Dec 09 '17 at 04:27
0

I created a list of tuple from the max and min date columns then searched for the date-timestamp inside the this list of tuples.

tuple_to_search = list(zip(df_A.amin,df_A.amax))

df_B['is_true']= df_B['created_date'].map(lambda k: any(filter(lambda x : x [0]<= k <=x[1],tuple_to_search ))).astype(int)
Stephen Kennedy
  • 20,585
  • 22
  • 95
  • 108
0

One option is with conditional_join from pyjanitor, which avoids a cartesian join (helpful with memory, and performance, depending on the data size):

# pip install pyjanitor
import pandas as pd
import janitor

(df_B
.conditional_join(
    df_A, 
    ('created_date', 'amin', '>='), 
    ('created_date', 'amax', '<='), 
    how = 'right')
.assign(label = lambda df: df.created_date.notna().astype(int))
.drop(columns='created_date')
)
                 amin                amax  label
0 2016-07-16 19:37:03 2016-07-17 11:16:32      1
1 2016-07-04 21:15:54 2016-07-05 10:57:46      1
2 2016-07-24 23:30:41 2016-07-25 15:38:02      0
3 2016-07-12 03:02:38 2016-07-12 22:11:01      1
sammywemmy
  • 27,093
  • 4
  • 17
  • 31