1

I have a problem setup using a pandas IntervalIndex, similar to those posed here and here, but with differences that have thus far prevented me from successfully implementing their solutions.

I have two DataFrames, constructed as such:

df1

    month   group   time    distance
0   1       A       10      100
1   1       A       20      120
2   1       A       25      110
3   1       B       5       90
4   1       B       7       99
5   1       B       17      60
6   1       C       30      200
7   1       C       35      170
8   2       A       5       40
9   2       A       10      27
10  2       B       17      33
11  2       B       19      65

df1.set_index(['month', 'group', 'time'], inplace=True)

df2

    month   group   start   end     period
0   1       A       5       15      1
1   1       A       15      21      2
2   1       A       21      30      3
3   1       B       2       10      1
4   1       B       10      20      2
5   2       A       3       8       1
6   2       A       8       15      2
7   2       B       10      20      1

df2.set_index(['month', 'group'], inplace=True)

Where start and end in df2 can be used to define an interval. My real data are much larger, with df1 on the order of 100,000 rows and df2 on the order of 10,000 rows.

What I would like to do is assign the period variable to df1, matching on month, group, and if df1['time'] is in the interval in df2.

There are two complications over the problems posed in the questions referenced above:

  1. The IntervalIndex needs to be just one part of a MultiIndex, and the matching needs to happen on all three levels of the index.

  2. Sometimes a time exists in df1 that doesn't match an interval in df2. I don't have a strong preference for if those rows are populated with NaN or another missing value, or if they are just removed altogether, but what I can't have is a KeyError.

My first steps were:

df2Index = pd.IntervalIndex.from_arrays(df2['start'], df2['end'], closed='left')
df2.set_index(df2Index, append=True, inplace=True)

to apply the IntervalIndex to df2 while keeping month and group as higher level indices.

I tried a few approaches, such as

period = df2.loc[df2.index.get_indexer(df1.index), 'period']
period = df2.get_loc(df1.index), 'period']

But I have been unable to get the correct indexing behavior. For completeness, my desired result is:

    month   group   time    distance    period
0   1       A       10      100         1
1   1       A       20      120         2
2   1       A       25      110         3
3   1       B       5       90          1
4   1       B       7       99          1
5   1       B       17      60          2
6   1       C       30      200         NaN
7   1       C       35      170         NaN
8   2       A       5       40          1
9   2       A       10      27          2
10  2       B       17      33          1
11  2       B       19      65          1

(or the same table but without the two rows where period is NaN).

My backup plan is to loop over the relevant combinations of month and group, subset the two tables appropriately, use the IntervalIndex on these subset tables, then recombine them. But that seems like the wrong solution.

Connor J
  • 219
  • 1
  • 9

3 Answers3

1

I came across this recently and since I don't haven't come across a solution using the MultiIndex, I think you might need to operate on grouped data and concatenate afterwards.

Using your df1 and df2:

df1_grouped = df1.groupby(['month','group'])
results = []
for gid, grp in df1_grouped:
    # Filter df2 for the group of df1
    df2_grouped = df2[(df2['month']==gid[0]) & (df2['group']==gid[1])]
    
    # Create the group interval index for the lookup table
    int_idx = pd.IntervalIndex.from_arrays(df2_grouped['start'], df2_grouped['end'], 'left')
    
    if len(int_idx) > 0:
        # Use the interval index and slice the lookup table based on the group of df1
        results.append(grp.assign(period=df2_grouped.set_index(int_idx).loc[grp['time']]['period'].values))
    else:
        results.append(grp)
pd.concat(results).reset_index(drop=True)

Output:

    month   group   time    distance    period
0   1   A   10  100 1.0
1   1   A   20  120 2.0
2   1   A   25  110 3.0
3   1   B   5   90  1.0
4   1   B   7   99  1.0
5   1   B   17  60  2.0
6   1   C   30  200 NaN
7   1   C   35  170 NaN
8   2   A   5   40  1.0
9   2   A   10  27  2.0
10  2   B   17  33  1.0
11  2   B   19  65  1.0
0

I would merge the two frames together using a left join, then filter to show only those rows where time is within the desired start and end period:

import pandas as pd

# setup the frame
df1 = pd.DataFrame(
    data={
        'month': [1,1,1,1,1,1,1,1,2,2,2,2],
        'group': ['A','A','A','B','B','B','C','C','A','A','B','B'],
        'time': [10,20,25,5,7,17,30,35,5,10,17,19],
        'distance': [100,120,110,90,99,60,200,170,40,27,33,65],
    })

df2 = pd.DataFrame(
    data={
        'month': [1,1,1,1,1,2,2,2],
        'group': ['A','A','A','B','B','A','A','B'],
        'start': [5,15,21,2,10,3,8,10],
        'end': [15,21,30,10,20,8,15,20],
        'period': [1,2,3,1,2,1,2,1],
    })

# merge, and filter
df = df1.merge(df2, how='left', on=['month','group'])
df = df[(df.time >= df.start) & (df.time <= df.end)][['month','group','time','distance','period']].reset_index(drop=True)

print df

    month   group   time    distance    period
0   1       A       10      100         1.0
1   1       A       20      120         2.0
2   1       A       25      110         3.0
3   1       B       5       90          1.0
4   1       B       7       99          1.0
5   1       B       17      60          2.0
6   2       A       5       40          1.0
7   2       A       10      27          2.0
8   2       B       17      33          1.0
9   2       B       19      65          1.0

Note that the above frame excludes NaNs. If you want them, update the filter criteria: df[((df.time >= df.start) & (df.time <= df.end)) | (df.period.isnull())]

NickBraunagel
  • 1,559
  • 1
  • 16
  • 30
  • So this will probably work and I'll give it a shot, but my one concern is scalability. In my real dataset, `df1` has on the order of several hundred thousand rows at least (and about 15 columns), and `df2` has tens of thousands of rows. I'm concerned that this left merge would be less than optimal from a performance perspective. – Connor J Mar 28 '18 at 15:38
  • Yeah, the above solution will likely not work for larger data sets (make sure to post the general size of your data sets so we can provide you with appropriate answers). You might need to look into pandas HDF capabilities. – NickBraunagel Apr 01 '18 at 15:28
-1

I'm trying to do this just now. My approach is to create a concatenated index column using

df.ID.str + df.date.astype(str).replace('-','').
Anshul Goyal
  • 73,278
  • 37
  • 149
  • 186