FWIW, a small optimization on Ch3steR's nice answer is to create the index
column by doing a groupby and taking a cumulative count of each group, and then offsetting the count by the index_start
:
rep = (df['index_end'] - df['index_start']) + 1
res = df.loc[df.index.repeat(rep)]
res['index'] = res.groupby(res.index).cumcount() + res['index_start']
Output (res
):
item index_start index_end index
0 A 1 3 1
0 A 1 3 2
0 A 1 3 3
1 B 4 7 4
1 B 4 7 5
1 B 4 7 6
1 B 4 7 7
And the timing, a la Ch3steR:
df = pd.concat([df]*1000, ignore_index=True)
%%timeit
rep = (df['index_end'] - df['index_start']) + 1
res = df.loc[df.index.repeat(rep)]
res['index'] = res.groupby(res.index).cumcount() + res['index_start']
# 1.66 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You'll have to trust that my timings for the other approaches are similar to what was posted : )
But to me, this feels like overkill for creating an index. If the indices referenced in your original rows are always contiguous and non-overlapping (i.e. if one entry spans i to j, the next starts at j + 1), then index
is just going to be a series of integers. So creating it with a range
should be correct:
res['index'] = np.arange(1, len(df) + 1)
Or maybe functionally-equivalently, you could just reset the DataFrame's index instead of creating index
:
res = res.reset_index(drop=True)
It seems from your example input that this would be sufficient. But the counter case would be if your index_start
and index_end
don't mean literal row numbers, and are more just used to compute how to repeat things. So for example, if your data looked like so:
# now there are gaps in the indices, and it doesn't start at 1
df = pd.DataFrame({'item':['A','B'],
'index_start':[3, 5],
'index_end':[19, 22]})
But you still wanted the first row to be repeated 3 times and second one 4 times, AND you wanted to create that index
column (in this case, [3,4,5,19,20,21,22]
), you would need to do the groupby operation or something similar.