I am struggling with this one.
Let's assume a dataframe that looks like this:
df = pd.DataFrame({'col0':['string1', 'string2'],
'col1':['some string','another string'],
'start':[100,1],
'end':[107,5]})
col0 col1 start end
0 string1 some string 100 107
1 string2 another string 1 5
The goal is the find the difference between start
and end
and then add that many rows to my dataframe, ffill
the rest of the columns, and add a cumulative count for the range between start
and end
. Expected output below:
df2 = pd.DataFrame({'col0':['string1']*8,
'col1':['some string']*8,
'new_col':[x for x in range(100,108)]})
df3 = pd.DataFrame({'col0':['string2']*5,
'col1':['another string']*5,
'new_col':[x for x in range(1,6)]})
output = pd.concat([df2,df3]).reset_index(drop=True)
col0 col1 new_col
0 string1 some string 100
1 string1 some string 101
2 string1 some string 102
3 string1 some string 103
4 string1 some string 104
5 string1 some string 105
6 string1 some string 106
7 string1 some string 107
8 string2 another string 1
9 string2 another string 2
10 string2 another string 3
11 string2 another string 4
12 string2 another string 5
My first though was to create a new dataframe...something like:
vals = list(zip(df['start'], df['end']+1))
pd.concat([pd.DataFrame([i], columns=['new_col']) for val in vals for i in range(*val)])
but this seems rather inefficient and I am struggling to add the remaining data.