4

I want to expand my data frame based on numeric values in two columns (index_start and index_end). My df looks like this:

item  index_start  index_end    
A          1            3
B          4            7

I want this to expand to create rows for A from 1 to 3 and rows for B from 4 to 7 like so.

item  index_start  index_end    index
A          1            3         1
A          1            3         2
A          1            3         3
B          4            7         4
B          4            7         5
B          4            7         6
B          4            7         7

Unsure how to implement this in Python/pandas.

Ch3steR
  • 20,090
  • 4
  • 28
  • 58
user3242036
  • 645
  • 1
  • 7
  • 16

3 Answers3

6

You could use .explode()

df['index'] = df.apply(lambda row: list(range(row['index_start'], row['index_end']+1)), axis=1)
df.explode('index')

  item  index_start  index_end index
0    A            1          3     1
0    A            1          3     2
0    A            1          3     3
1    B            4          7     4
1    B            4          7     5
1    B            4          7     6
1    B            4          7     7
Andreas
  • 8,694
  • 3
  • 14
  • 38
3

We can use pd.Index.repeat with df.reindex to repeat each of the elements index_end - index_start + 1 times. Then generate the data between index_start and index_end using itertools.starmap.

from itertools import starmap

rep = (df['index_end'] - df['index_start']) + 1
idx = df.index.repeat(rep)
vals = df.to_numpy()[:, 1:]
vals[:, 1] += 1
index = np.hstack([*starmap(np.arange, vals)])
df.reindex(idx).assign(index=index)

  item  index_start  index_end  index
0    A            1          3      1
0    A            1          3      2
0    A            1          3      3
1    B            4          7      4
1    B            4          7      5
1    B            4          7      6
1    B            4          7      7

@Andreas has nice readable solution but there are two bottleneck operations that would increase the runtime one is df.apply over axis 1 and df.explode.

Timeit analysis:

Benchmarking setup:

df = pd.concat([df]*1000, ignore_index=True)

# @Ch3steR's solution
In [130]: %%timeit 
     ...: rep = (df['index_end'] - df['index_start']) + 1 
     ...: idx = df.index.repeat(rep) 
     ...: vals = df.to_numpy()[:, 1:] 
     ...: vals[:, 1] += 1 
     ...: index = np.hstack([*starmap(np.arange, vals)]) 
     ...: df.reindex(idx).assign(index=index) 
     ...:  
     ...:                                                                       
4.82 ms ± 72.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# @Andreas' solution
In [131]: %%timeit 
     ...: df['index'] = df.apply(lambda row: list(range(row['index_start'], row[
     ...: 'index_end']+1)), axis=1) 
     ...: df.explode('index') 
     ...:  
     ...:                                                                       
20.4 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I ran tests a few times my solution is about ~4 times faster. We can further improve the speed if we could somehow index = np.hstack([*starmap(np.arange, vals)]) vectorized this operation.

Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • 1
    repeat / reindex is a neat solution. The performance vs readbility is also a good point. – Andreas Aug 25 '21 at 18:18
  • @Andreas Thank you. Yeah readability over performance anyday. Given the readable solution isn't dragging you down. Great solution to you too. – Ch3steR Aug 25 '21 at 18:24
1

FWIW, a small optimization on Ch3steR's nice answer is to create the index column by doing a groupby and taking a cumulative count of each group, and then offsetting the count by the index_start:

rep = (df['index_end'] - df['index_start']) + 1
res = df.loc[df.index.repeat(rep)]
res['index'] = res.groupby(res.index).cumcount() + res['index_start']

Output (res):

  item  index_start  index_end  index
0    A            1          3      1
0    A            1          3      2
0    A            1          3      3
1    B            4          7      4
1    B            4          7      5
1    B            4          7      6
1    B            4          7      7

And the timing, a la Ch3steR:

df = pd.concat([df]*1000, ignore_index=True)

%%timeit
rep = (df['index_end'] - df['index_start']) + 1
res = df.loc[df.index.repeat(rep)]
res['index'] = res.groupby(res.index).cumcount() + res['index_start']
# 1.66 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You'll have to trust that my timings for the other approaches are similar to what was posted : )


But to me, this feels like overkill for creating an index. If the indices referenced in your original rows are always contiguous and non-overlapping (i.e. if one entry spans i to j, the next starts at j + 1), then index is just going to be a series of integers. So creating it with a range should be correct:

res['index'] = np.arange(1, len(df) + 1)

Or maybe functionally-equivalently, you could just reset the DataFrame's index instead of creating index:

res = res.reset_index(drop=True)

It seems from your example input that this would be sufficient. But the counter case would be if your index_start and index_end don't mean literal row numbers, and are more just used to compute how to repeat things. So for example, if your data looked like so:

# now there are gaps in the indices, and it doesn't start at 1
df = pd.DataFrame({'item':['A','B'],
                   'index_start':[3, 5],
                   'index_end':[19, 22]})

But you still wanted the first row to be repeated 3 times and second one 4 times, AND you wanted to create that index column (in this case, [3,4,5,19,20,21,22]), you would need to do the groupby operation or something similar.

Tom
  • 8,310
  • 2
  • 16
  • 36