I'm trying to reshape a dataframe to get it into a more usable structure for graphing and right now what I've been doing is reshaping the df using iterrows or itertuples, following What is the most efficient way to loop through dataframes with pandas?
Below is a overly simplified dataset, but the real dataset will have tens of thousands of more rows.
group subtopic code
fruit grapes 110A
fruit apple 110B
meat pork 220A
meat chicken 220B
meat duck 220C
vegetable lettuce 300A
vegetable tomato 310A
vegetable asparagus 320A
Basically, I want to create a new column ("code2") based on whether column ("code") shares the same value in column "group".
I tried running the following code:
df = pd.read_excel(file1, sheetname = 'Sheet3')
def reshape_iterrows(df):
reshape = []
for i, j, in df.iterrows():
for _, k in df.iterrows():
if (j['code'] == k['code']):
pass
elif j['group'] == 'nan':
reshape.append({'code1':j['code'],
'code2': j['code'],
'group': 'None'})
elif (j['group'] == k['group']):
reshape.append({'code1': j['code'],
'code2': k['code'],
'group': j['group']})
else:
pass
return reshape
reshape_iterrows(df)
or using itertuples:
def reshape_iterrows(df):
reshape = []
for row1 df.itertuples():
for row2 in df.itertuples():
if (row1[3] == row2[3]):
pass
elif row1[1] == 'nan':
reshape.append({'code1':row1[3],
'code2': row1[3],
'group': 'None'})
elif (row1[1] == row2[1]):
reshape.append({'code1': row1[3],
'code2': row2[3],
'group': row1[1]})
else:
pass
return reshape
I pass reshape to pd.DataFrame() and the expected output is below, which I then use the code1 and code2 columns as the source and target parameters within nx.from_pandas_edgelist to generate a graph.
code1 code2 group
0 110A 110B fruit
1 110B 110A fruit
2 220A 220B meat
3 220A 220C meat
4 220B 220A meat
5 220B 220C meat
6 220C 220A meat
7 220C 220B meat
8 300A 300B vegetable
9 300A 300C vegetable
10 300B 300A vegetable
11 300B 300C vegetable
12 300C 300A vegetable
13 300C 300B vegetable
Like others I'm interested in finding a more efficient way to iterate perhaps using Numpy's boolean operations? Looking for guidance on how I would approach getting the same result using vectorized/array operations.
Thanks!