3

I am new to python, and need some help with a question I am having regarding the date time function.

I have df_a which has a column titled time, and I am trying to create a new column id in this df_a.

I want the id column to be determined by whether or not the time is contained within a range of times on df_b columns between "date" and "date_new", for example the first row has a date of "2019-01-07 20:52:41" and "date_new" of "2019-01-07 21:07:41" (a 15 minute time interval), I would like the index for this row, to appear as my id in df_a for when the time is "2019-01-07 20:56:30" (i.e. with id=0) and so on for all the rows in df_a,

This question is similar, but cannot figure out how to make it work with mine as I keep getting

python assign value to pandas df if falls between range of dates in another df

s = pd.Series(df_b['id'].values,pd.IntervalIndex.from_arrays(df_b['date'],df_b['date_new'])) 
df_a['id']=df_a['time'].map(s)

ValueError: cannot handle non-unique indices

one caveat is that the ranges in df_b are not always unique, meaning some of the intervals contain the same periods of time, in these cases its fine if it uses the id of the first time period in df_b that it falls in, additionally there are over 200 rows in df_b, and 2000 in df_a, so it will take to long to define each time period in a for-loop type format, unless there is an easier way to do it than defining each, thank you in advance for all of your help! if this could use any clarification please let me know!

df_a

time                    id
2019-01-07 22:02:56     NaN
2019-01-07 21:57:12     NaN
2019-01-08 09:35:30     NaN


df_b

date                    date_new               id
2019-01-07 21:50:56    2019-01-07 22:05:56     0
2019-01-08 09:30:30    2019-01-08 09:45:30     1

Expected Result

df_a     
time                    id
2019-01-07 22:02:56     0
2019-01-07 21:57:12     0
2019-01-08 09:35:30     1
sentence
  • 8,213
  • 4
  • 31
  • 40
Joe Meyer
  • 35
  • 7

2 Answers2

2

Let me rephrase your problem. For each row in dataframe df_a you want to check whether its value in df_a['time'] is in the interval given by the values in columns df_b['date'] and df_b['date_new']. If so, set the value in df_a["id"] as that in the corresponding df_b["id"].

If this is your question, this is a (very rough) solution:

for ia, ra in df_a.iterrows():
    for ib, rb in df_b.iterrows():
        if (ra["time"]>=rb['date']) & (ra["time"]<=rb['date_new']):
            df_a.loc[ia, "id"] = rb["id"]
            break
sentence
  • 8,213
  • 4
  • 31
  • 40
1

pandas doesn't have great support for non-equi joins, which is what you are looking for, but it does have a function merge_asof which you might want to check out: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.merge_asof.html

This should significantly speed up your join.

For example:

df_a = pd.DataFrame({'time': ['2019-01-07 22:02:56', '2019-01-07 21:57:12', '2019-01-08 09:35:30']})
df_b = pd.DataFrame({'date': ['2019-01-07 21:50:56', '2019-01-08 09:30:30'], 'date_new': ['2019-01-07 22:05:56', '2019-01-08 09:45:30'], 'id':[0,1]})
df_a['time'] = pd.to_datetime(df_a['time'])
df_b['date'] = pd.to_datetime(df_b['date'])
df_b['date_new'] = pd.to_datetime(df_b['date_new'])

#you need to sort df_a first before using merge_asof
df_a.sort_values('time',inplace=True)
result = pd.merge_asof(df_a, df_b, left_on='time', right_on='date')

#get rid of rows where df_a.time values are greater than df_b's new date
result = result[result.date_new > result.time]
Allen Wang
  • 2,426
  • 2
  • 24
  • 48