I have two dataframes. One that has timestamps and projects in it. The other has date ranges, projects and maturities, which have to be mapped to the corresponding project within a date range.
My problem is similar to this question, however the provided answer is very slow and I have an extra condition to be fulfilled. For starters my two dataframes kind of look like this:
In:
import pandas as pd
df_a = pd.DataFrame({'time': ('06.05.2015 16:15:16', '22.06.2015 08:52:05', '28.05.2015 18:20:21','28.06.2015 16:19:21'),
'project': ('project1', 'project2', 'project2', 'project1')})
df_b = pd.DataFrame({'start-date': ('02.05.2015 00:00:00', '26.06.2015 00:00:00', '16.05.2015 00:00:00', '30.05.2015 00:00:00'),
'end-date':('24.06.2015 00:00:00', '27.07.2015 00:00:00', '27.05.2015 00:00:00', '27.06.2015 00:00:00'),
'project': ('project1','project1','project2','project2'),
'maturity': ('one','two', 'one','two')})
My code looks like this:
for i in df_a.project.unique():
for j in df_b.project.unique():
if i == j:
for index_df_a, row_df_a in df_a.iterrows():
for index_df_b, row_df_b in df_b.iterrows():
if (row_df_a['time'] >= row_df_b['start-date']) & (row_df_a['time'] <= row_df_b['end-date']):
df_a.loc[index_df_a, 'maturity'] = row_df_b.loc['maturity']
break
Out:
time project maturity
0 06.05.2015 16:15:16 project1 one
1 22.06.2015 08:52:05 project2 one
2 28.05.2015 18:20:21 project2 NaN
3 28.06.2015 16:19:21 project1 NaN
Expected result:
time project maturity
0 06.05.2015 16:15:16 project1 one
1 22.06.2015 08:52:05 project2 one
2 28.05.2015 18:20:21 project2 two
3 28.06.2015 16:19:21 project1 two
- The code I wrote with the help of the similar question is very slow. When I am applying it to my dataframe (df_a has ~ 900k rows and df_b has ~1.7k maturity to map), it takes a really long time to process even a thousand rows. Is there a way to speed up the process?
- I think my
if i==j:
statement is wrong. As seen in line 4 of the result: even though the project is mapped toproject1
and the timestamp28.06.2015 16:19:21
is within the range ofstart:26.06.2015 00:00:00 | end: 27.07.2015 00:00:00
the maturity isNaN
instead oftwo
. - Lastly, is it possible to create a condition, where even if the timestamp is not within give date ranges, the next provided maturity for each project is mapped (line 3)? So if
28.05.2015 18:20:21
is not in any date range, the next date range provides the maturity. In this casetwo
.
Excuse if I am asking too much at once. I know it is best practice to derive the answer by asking the simple questions and achieve results stepwise, however I am not nearly experienced enough to be able to split the problem into smaller parts.