I have a problem setup using a pandas IntervalIndex, similar to those posed here and here, but with differences that have thus far prevented me from successfully implementing their solutions.
I have two DataFrames, constructed as such:
df1
month group time distance
0 1 A 10 100
1 1 A 20 120
2 1 A 25 110
3 1 B 5 90
4 1 B 7 99
5 1 B 17 60
6 1 C 30 200
7 1 C 35 170
8 2 A 5 40
9 2 A 10 27
10 2 B 17 33
11 2 B 19 65
df1.set_index(['month', 'group', 'time'], inplace=True)
df2
month group start end period
0 1 A 5 15 1
1 1 A 15 21 2
2 1 A 21 30 3
3 1 B 2 10 1
4 1 B 10 20 2
5 2 A 3 8 1
6 2 A 8 15 2
7 2 B 10 20 1
df2.set_index(['month', 'group'], inplace=True)
Where start
and end
in df2
can be used to define an interval. My real data are much larger, with df1
on the order of 100,000 rows and df2
on the order of 10,000 rows.
What I would like to do is assign the period
variable to df1
, matching on month
, group
, and if df1['time']
is in the interval in df2
.
There are two complications over the problems posed in the questions referenced above:
The
IntervalIndex
needs to be just one part of aMultiIndex
, and the matching needs to happen on all three levels of the index.Sometimes a time exists in
df1
that doesn't match an interval indf2
. I don't have a strong preference for if those rows are populated withNaN
or another missing value, or if they are just removed altogether, but what I can't have is aKeyError
.
My first steps were:
df2Index = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='left')
df2.set_index(df2Index, append=True, inplace=True)
to apply the IntervalIndex to df2
while keeping month
and group
as higher level indices.
I tried a few approaches, such as
period = df2.loc[df2.index.get_indexer(df1.index), 'period']
period = df2.get_loc(df1.index), 'period']
But I have been unable to get the correct indexing behavior. For completeness, my desired result is:
month group time distance period
0 1 A 10 100 1
1 1 A 20 120 2
2 1 A 25 110 3
3 1 B 5 90 1
4 1 B 7 99 1
5 1 B 17 60 2
6 1 C 30 200 NaN
7 1 C 35 170 NaN
8 2 A 5 40 1
9 2 A 10 27 2
10 2 B 17 33 1
11 2 B 19 65 1
(or the same table but without the two rows where period
is NaN
).
My backup plan is to loop over the relevant combinations of month
and group
, subset the two tables appropriately, use the IntervalIndex on these subset tables, then recombine them. But that seems like the wrong solution.