1

I'm sure this is simple, but I can't wrap my head around it. Essentially I have two dataframes, a large df that contains process data every six hours and a smaller df that contains a condition number, a start date and an end date. I need to fill the condition column of the large dataframe with the condition number that corresponds to the date range, or else leave it blank if the dates do not fall between any date range in the small df. So my two frames would look like this:

Large df
Date            P1  P2
7/1/2019 11:00  102 240
7/1/2019 17:00  102 247
7/1/2019 23:00  100 219
7/2/2019 5:00   107 213
7/2/2019 11:00  100 226
7/2/2019 17:00  104 239
7/2/2019 23:00  110 240
7/3/2019 5:00   110 232
7/3/2019 11:00  102 215
7/3/2019 17:00  103 219
7/3/2019 23:00  107 243
7/4/2019 5:00   107 246
7/4/2019 11:00  103 219
7/4/2019 17:00  105 220
7/4/2019 23:00  107 220
7/5/2019 5:00   107 227
7/5/2019 11:00  108 208
7/5/2019 17:00  110 248
7/5/2019 23:00  107 235


Small df
Condition   Start Time  End Time
A        7/1/2019 11:00 7/2/2019 5:00
B        7/3/2019 5:00  7/3/2019 23:00
C        7/4/2019 23:00 7/5/2019 17:00

And I need the result to look like this:

Date            P1  P2  Cond
7/1/2019 11:00  102 240 A
7/1/2019 17:00  102 247 A
7/1/2019 23:00  100 219 A
7/2/2019 5:00   107 213 A
7/2/2019 11:00  100 226 
7/2/2019 17:00  104 239 
7/2/2019 23:00  110 240 
7/3/2019 5:00   110 232 B
7/3/2019 11:00  102 215 B
7/3/2019 17:00  103 219 B
7/3/2019 23:00  107 243 B
7/4/2019 5:00   107 246 
7/4/2019 11:00  103 219 
7/4/2019 17:00  105 220 
7/4/2019 23:00  107 220 C
7/5/2019 5:00   107 227 C
7/5/2019 11:00  108 208 C
7/5/2019 17:00  110 248 C
7/5/2019 23:00  107 235 
kdsprogrammer
  • 136
  • 1
  • 13

4 Answers4

4

You need:

for i, row in sdf.iterrows():
    df.loc[df['Date'].between(row['Start Time'], row['End Time']), 'Cond'] = row['Condition']

Output:

                Date    P1  P2  Cond
0   2019-07-01 11:00:00 102 240 A
1   2019-07-01 17:00:00 102 247 A
2   2019-07-01 23:00:00 100 219 A
3   2019-07-02 05:00:00 107 213 A
4   2019-07-02 11:00:00 100 226 NaN
5   2019-07-02 17:00:00 104 239 NaN
6   2019-07-02 23:00:00 110 240 NaN
7   2019-07-03 05:00:00 110 232 B
8   2019-07-03 11:00:00 102 215 B
9   2019-07-03 17:00:00 103 219 B
10  2019-07-03 23:00:00 107 243 B
11  2019-07-04 05:00:00 107 246 NaN
12  2019-07-04 11:00:00 103 219 NaN
13  2019-07-04 17:00:00 105 220 NaN
14  2019-07-04 23:00:00 107 220 C
15  2019-07-05 05:00:00 107 227 C
16  2019-07-05 11:00:00 108 208 C
17  2019-07-05 17:00:00 110 248 C
18  2019-07-05 23:00:00 107 235 NaN
harvpan
  • 8,571
  • 2
  • 18
  • 36
  • 1
    I have to disagree, ``iterrows`` is slow and doesn't scale, it shouldn't be used. – Nathan Furnal Aug 26 '19 at 20:11
  • The dataframe where I have used is supposed to be *small* and only used to lookup. That is why [iterrows is justified](https://stackoverflow.com/questions/24870953/does-pandas-iterrows-have-performance-issues). @NaturalFrequency – harvpan Aug 26 '19 at 20:13
  • I understand, but without knowing the OP's setting and if they might use is later on larger dataframes, I still don't think it's correct to point it as an answer. – Nathan Furnal Aug 26 '19 at 20:18
  • I do not see a problem with this solution. @NaturalFrequency – harvpan Aug 26 '19 at 20:20
  • Let's agree to disagree – Nathan Furnal Aug 26 '19 at 20:28
4

You may try pd.IntervalIndex and map as follows:

inx = pd.IntervalIndex.from_arrays(df2['Start Time'], df2['End Time'], closed='both')
df2.index = inx
df1['cond'] = df1.Date.map(df2.Condition)

Out[423]:
                  Date   P1   P2 cond
0  2019-07-01 11:00:00  102  240    A
1  2019-07-01 17:00:00  102  247    A
2  2019-07-01 23:00:00  100  219    A
3  2019-07-02 05:00:00  107  213    A
4  2019-07-02 11:00:00  100  226  NaN
5  2019-07-02 17:00:00  104  239  NaN
6  2019-07-02 23:00:00  110  240  NaN
7  2019-07-03 05:00:00  110  232    B
8  2019-07-03 11:00:00  102  215    B
9  2019-07-03 17:00:00  103  219    B
10 2019-07-03 23:00:00  107  243    B
11 2019-07-04 05:00:00  107  246  NaN
12 2019-07-04 11:00:00  103  219  NaN
13 2019-07-04 17:00:00  105  220  NaN
14 2019-07-04 23:00:00  107  220    C
15 2019-07-05 05:00:00  107  227    C
16 2019-07-05 11:00:00  108  208    C
17 2019-07-05 17:00:00  110  248    C
18 2019-07-05 23:00:00  107  235  NaN
Andy L.
  • 24,909
  • 4
  • 17
  • 29
1

You could do something like the following:

df1 = pd.read_csv(io.StringIO(s1), sep='\s\s+', engine='python',
                                                converters={'Date': pd.to_datetime})

df2 = pd.read_csv(io.StringIO(s2), sep='\s\s+', engine='python',
                converters={'Start Time': pd.to_datetime, 'End Time': pd.to_datetime})


df2 = df2.set_index('Condition').stack().reset_index()
df = pd.merge_asof(df1, df2, left_on='Date', right_on=0, direction='backward')
df.loc[(df['level_1'].eq('End Time')) & (df['Date'] > df[0]), 'Condition'] = ''

print(df.iloc[:, :-2])

                  Date   P1   P2 Condition
0  2019-07-01 11:00:00  102  240         A
1  2019-07-01 17:00:00  102  247         A
2  2019-07-01 23:00:00  100  219         A
3  2019-07-02 05:00:00  107  213         A
4  2019-07-02 11:00:00  100  226          
5  2019-07-02 17:00:00  104  239          
6  2019-07-02 23:00:00  110  240          
7  2019-07-03 05:00:00  110  232         B
8  2019-07-03 11:00:00  102  215         B
9  2019-07-03 17:00:00  103  219         B
10 2019-07-03 23:00:00  107  243         B
11 2019-07-04 05:00:00  107  246          
12 2019-07-04 11:00:00  103  219          
13 2019-07-04 17:00:00  105  220          
14 2019-07-04 23:00:00  107  220         C
15 2019-07-05 05:00:00  107  227         C
16 2019-07-05 11:00:00  108  208         C
17 2019-07-05 17:00:00  110  248         C
18 2019-07-05 23:00:00  107  235        
manwithfewneeds
  • 1,137
  • 1
  • 7
  • 10
0
df1.insert(3, "Cond", [None] * len(df1))
for i in range(len(df2)):
    df1.loc[(df1["Date"] >= df2["Start Time"].loc[i]) * (df1["Date"] <= df2["End Time"].loc[i]), "Cond"] = df2["Condition"].loc[i]    
Mehdi
  • 4,202
  • 5
  • 20
  • 36