I need to generate a dataframe based on another one. There are two steps based on input df.
The input df has 4 columns. The output should be done this way:
1) Take value from col1
to generate that many rows in output, where col opt
is rewritten, new_col1
equals f"{value_from_col0}_{loop_iterator_with_limit_from_col1}"
, column src
equals 'src1'.
2) Take value from col2
, split with |
as a separator. For each split element, find it in the input df, take value from col0
and generate rows in a similar way as in 1). src
equals 'src2'.
df = pd.DataFrame([
['opt1', 'a', 2, ''],
['opt2', 'b', 1, ''],
['opt9', 'z', 3, 'a|b'],
['opt8', 'y', 3, 'a']],
columns=['opt', 'col0', 'col1', 'col2'])
out = pd.DataFrame()
new_rows = []
for i, row in df.iterrows():
for j in range(row['col1']):
new_row = dict()
new_row['opt'] = row['opt']
new_row['new_col'] = f"{row['col0']}_{j+1}"
new_row['src'] = 'src1'
new_rows.append(new_row)
for s in row['col2'].split('|'):
if s:
col1_value = df.loc[df['col0'] == s]['col1'].values[0]
for k in range(col1_value):
new_row = dict()
new_row['opt'] = row['opt']
new_row['new_col'] = f"{s}_{k + 1}"
new_row['src'] = 'src2'
new_rows.append(new_row)
out = out.append(new_rows, ignore_index=True)
Below you can find the expected output. I used iterrows()
which is pretty slow. I believe there is a more efficient pandas' way to achieve same thing. Of course, it can be sorted in a different way, it doesn't matter.
new_col opt src
0 a_1 opt1 src1
1 a_2 opt1 src1
2 b_1 opt2 src1
3 z_1 opt9 src1
4 z_2 opt9 src1
5 z_3 opt9 src1
6 a_1 opt9 src2
7 a_2 opt9 src2
8 b_1 opt9 src2
9 y_1 opt8 src1
10 y_2 opt8 src1
11 y_3 opt8 src1
12 a_1 opt8 src2
13 a_2 opt8 src2