I am looking for efficient way to compute difference of all rows in dask dataframe(preferred) Or any efficient way in pandas df.I have huge file with millions of rows, it is taking like forever to compute this.. Below is example:
d = {'A': [5, 9, 15, 0, 2], 'B': [7, 6, 3, 1, 4], 'C': [8, 11, 2, 5, 3]}
df = pd.DataFrame(data=d, index=['i1', 'i2', 'i3', 'i4', 'i5'])
print(df)
tmp = pd.DataFrame(columns=['A', 'B', 'C'])
row_pointer = 1
for index, row in df.iterrows():
for i, r in df.iloc[row_pointer:].iterrows():
tmp.loc['(' + index + ' - ' + i + ')'] = df.loc[index] - df.loc[i]
row_pointer += 1
print(tmp)
Output:
A B C
i1 5 7 8
i2 9 6 11
i3 15 3 2
i4 0 1 5
i5 2 4 3
A B C
(i1 - i2) -4 1 -3
(i1 - i3) -10 4 6
(i1 - i4) 5 6 3
(i1 - i5) 3 3 5
(i2 - i3) -6 3 9
(i2 - i4) 9 5 6
(i2 - i5) 7 2 8
(i3 - i4) 15 2 -3
(i3 - i5) 13 -1 -1
(i4 - i5) -2 -3 2