2

I have two dataframes. One that has timestamps and projects in it. The other has date ranges, projects and maturities, which have to be mapped to the corresponding project within a date range.

My problem is similar to this question, however the provided answer is very slow and I have an extra condition to be fulfilled. For starters my two dataframes kind of look like this:

In:
import pandas as pd
df_a = pd.DataFrame({'time': ('06.05.2015  16:15:16', '22.06.2015  08:52:05', '28.05.2015  18:20:21','28.06.2015  16:19:21'),
                     'project': ('project1', 'project2', 'project2', 'project1')})

df_b = pd.DataFrame({'start-date':  ('02.05.2015 00:00:00', '26.06.2015 00:00:00', '16.05.2015 00:00:00', '30.05.2015 00:00:00'),
                     'end-date':('24.06.2015 00:00:00', '27.07.2015 00:00:00', '27.05.2015 00:00:00', '27.06.2015 00:00:00'),
                     'project': ('project1','project1','project2','project2'),
                     'maturity': ('one','two', 'one','two')})

My code looks like this:


for i in df_a.project.unique():
    for j in df_b.project.unique():
        if i == j:
            for index_df_a, row_df_a in df_a.iterrows():
                for index_df_b, row_df_b in df_b.iterrows():
                    if (row_df_a['time'] >= row_df_b['start-date']) & (row_df_a['time'] <= row_df_b['end-date']):
                        df_a.loc[index_df_a, 'maturity'] = row_df_b.loc['maturity']
                        break


Out:
                   time   project maturity
0  06.05.2015  16:15:16  project1      one
1  22.06.2015  08:52:05  project2      one
2  28.05.2015  18:20:21  project2      NaN
3  28.06.2015  16:19:21  project1      NaN

Expected result:

                   time   project maturity
0  06.05.2015  16:15:16  project1      one
1  22.06.2015  08:52:05  project2      one
2  28.05.2015  18:20:21  project2      two
3  28.06.2015  16:19:21  project1      two

  1. The code I wrote with the help of the similar question is very slow. When I am applying it to my dataframe (df_a has ~ 900k rows and df_b has ~1.7k maturity to map), it takes a really long time to process even a thousand rows. Is there a way to speed up the process?
  2. I think my if i==j:statement is wrong. As seen in line 4 of the result: even though the project is mapped to project1 and the timestamp 28.06.2015 16:19:21is within the range of start:26.06.2015 00:00:00 | end: 27.07.2015 00:00:00 the maturity is NaN instead of two.
  3. Lastly, is it possible to create a condition, where even if the timestamp is not within give date ranges, the next provided maturity for each project is mapped (line 3)? So if 28.05.2015 18:20:21 is not in any date range, the next date range provides the maturity. In this case two.

Excuse if I am asking too much at once. I know it is best practice to derive the answer by asking the simple questions and achieve results stepwise, however I am not nearly experienced enough to be able to split the problem into smaller parts.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Audiogott
  • 95
  • 2
  • 12

1 Answers1

2
  • This should be significantly faster than the current implementation
  • If you find yourself iterating through a dataframe, you're probably doing it wrong
  • Use pd.date_range with start-date and end-date to add a d_range column to df_b, then .isin can be used to find time from df_a inside of d_range
    • d_range will be a list of dates between start and end.
    • If time is not properly formatted, it will not be match the dates inside d_range.
  • The implementation will clean up the time columns in each dataframe
    • Set columns to datetime format
    • We only care about the date, not the time
    • As already stated, the datetime format is integral. If the columns are not properly formatted, time will not be found inside d_time.

Update the two dataframes:

import pandas as pd

# create dataframes from your test set and clean-up the datetime columns
df_a['time'] = (pd.to_datetime(df_a['time'], format='%d.%m.%Y %H:%M:%S')).dt.date
df_b['start-date'] = pd.to_datetime(df_b['start-date'], format='%d.%m.%Y %H:%M:%S').dt.date
df_b['end-date'] = pd.to_datetime(df_b['end-date'], format='%d.%m.%Y %H:%M:%S').dt.date

# df_a view

       time   project
 2015-05-06  project1
 2015-06-22  project2
 2015-05-28  project2
 2015-06-28  project1

# df_b view

start-date   end-date   project maturity
2015-05-02 2015-06-24  project1      one
2015-06-26 2015-07-27  project1      two
2015-05-16 2015-05-27  project2      one
2015-05-30 2015-06-27  project2      two

# add d_range to df_b
df_b['d_range'] = df_b[['start-date', 'end-date']].apply(lambda x: pd.date_range(x[0], x[1]), axis=1)

Add maturity to df_a

  • mask is the result of searching df_b for the date from df_a
    • mask matches the date of any project
  • return only the result for the matching project
def date_query(x):
    mask = df_b[['project', 'maturity']][df_b['d_range'].apply(lambda y: y.isin([x[0]]).any())].reset_index(drop=True)
    result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
    return result

# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)

# df_a updated

       time   project maturity
 2015-05-06  project1      one
 2015-06-22  project2      two
 2015-05-28  project2      NaN
 2015-06-28  project1      two

Item 3:

  • result from def date_query is a pandas.Series, if there is no matching date range, it will be empty, which can be checked with .empty
  • Update def date_query to check if result is empty. Call def check_min_timedelta if result is empty.
  • If there are multiple matching minimum values, .idxmin will return the first occurence
def check_min_timedelta(x):
    """
    Create a timedelta between time and end-date
    Return maturity for the row with the minimum time date
    """
    end_diff = abs(df_b['end-date'][df_b['project'] == x[1]] - x[0]).idxmin()
    return df_b['maturity'].loc[end_diff]

# update def date_query
def date_query(x):
    mask = df_b[['project', 'maturity']][df_b['d_range'].apply(lambda y: y.isin([x[0]]).any())].reset_index(drop=True)
    result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
    if result.empty:
        result = check_min_timedelta(x)
    return result

# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)

# final df_a:

       time   project maturity
 2015-05-06  project1      one
 2015-06-22  project2      two
 2015-05-28  project2      one
 2015-06-28  project1      two

Alternate Approach:

  • This is similar to your original approach of using inequalities to compare time to start-date & end-date
  • This approach does not rely on creating a column of date range lists
  • def date_query(x): has been updated and .dt.date has been removed.
df_a['time'] = pd.to_datetime(df_a['time'], format='%d.%m.%Y %H:%M:%S')
df_b['start-date'] = pd.to_datetime(df_b['start-date'], format='%d.%m.%Y %H:%M:%S')
df_b['end-date'] = pd.to_datetime(df_b['end-date'], format='%d.%m.%Y %H:%M:%S')

def check_min_timedelta(x):
    """
    Create a timedelta between time and end-date
    Return maturity for the row with the minimum time date
    """
    end_diff = abs(df_b['end-date'][df_b['project'] == x[1]] - x[0]).idxmin()
    return df_b['maturity'].loc[end_diff]

# update def date_query
def date_query(x):
    mask = df_b[['project', 'maturity']][df_b[['start-date', 'end-date']].apply(lambda y: ((x[0] >= y[0]) & (x[0] <= y[1])), axis=1)].reset_index(drop=True)
    result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
    if result.empty:
        result = check_min_timedelta(x)
    return result

# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158