Efficient way to compute difference of all rows in dask dataframe

Question

I am looking for efficient way to compute difference of all rows in dask dataframe(preferred) Or any efficient way in pandas df.I have huge file with millions of rows, it is taking like forever to compute this.. Below is example:

d = {'A': [5, 9, 15, 0, 2], 'B': [7, 6, 3, 1, 4], 'C': [8, 11, 2, 5, 3]}
df = pd.DataFrame(data=d, index=['i1', 'i2', 'i3', 'i4', 'i5'])
print(df)
tmp = pd.DataFrame(columns=['A', 'B', 'C'])
row_pointer = 1
for index, row in df.iterrows():
    for i, r in df.iloc[row_pointer:].iterrows():
        tmp.loc['(' + index + ' - ' + i + ')'] = df.loc[index] - df.loc[i]
    row_pointer += 1
print(tmp)

Output:

 A  B   C
i1   5  7   8
i2   9  6  11
i3  15  3   2
i4   0  1   5
i5   2  4   3
         A   B   C
(i1 - i2)   -4   1  -3
(i1 - i3)  -10   4   6
(i1 - i4)    5   6   3
(i1 - i5)    3   3   5
(i2 - i3)   -6   3   9
(i2 - i4)    9   5   6
(i2 - i5)    7   2   8
(i3 - i4)   15   2  -3
(i3 - i5)   13  -1  -1
(i4 - i5)   -2  -3   2

you can easily get rid of one loop by using broadcasting: `df - df.iloc[i,:]` subtracts row I from the whole df. Just over all rows and just drop the all zeros row per frame — Quickbeam2k1, Sep 10 '19 at 18:39

ALollz · Accepted Answer · 2019-09-10T19:10:01.290

3

Use broadcasting to do all of the subtractions. Then subset after with < to get all unique combinations.

import pandas as pd
import numpy as np

arr = df.to_numpy()
res = pd.DataFrame(np.vstack(arr[:, None]-arr), 
                   index=pd.MultiIndex.from_product([df.index, df.index]))
res = res[res.index.get_level_values(0) < res.index.get_level_values(1)]

print(res)

        0  1  2
i1 i2  -4  1 -3
   i3 -10  4  6
   i4   5  6  3
   i5   3  3  5
i2 i3  -6  3  9
   i4   9  5  6
   i5   7  2  8
i3 i4  15  2 -3
   i5  13 -1 -1
i4 i5  -2 -3  2

Likely not feasible with millions of rows. Alternatively remove one level of the loop:

from itertools import chain 

arr = df.to_numpy()
data = [(arr[i, None]-arr)[i+1:] for i in range(np.shape(arr)[0])]

idx = pd.MultiIndex.from_product([df.index, df.index])
idx = idx[idx.get_level_values(0) < idx.get_level_values(1)]

res = pd.DataFrame(chain.from_iterable(data), index=idx)

edited Sep 10 '19 at 19:10

answered Sep 10 '19 at 18:44

ALollz

57,915
7
66
89

1

Upvoted. This is definitely really fast. However, I guess it will take huge amount of memory on dataset of millions of rows :). +1 – Andy L. Sep 10 '19 at 18:51
1

This is almost identical to my answer, but 2 minutes earlier. – Quang Hoang Sep 10 '19 at 18:52
Yup It is working for small dataset but for huge file, I am getting 'MemoryError' – Manvi Sep 10 '19 at 19:02
Though the index creation will be slow. You can look into https://stackoverflow.com/questions/53699012/performant-cartesian-product-cross-join-with-pandas – ALollz Sep 10 '19 at 19:11
@ALollz memory error again on line chain.from_iterable(data), therefore looking for dask, if it is possible to use distributed computing for such tasks. – Manvi Sep 10 '19 at 19:23
@Manvi honestly I don't think that's really going to solve your problem. With even just 3M rows you have 3M Choose 2 (4.5 Trillion) rows in your output. This is an enormous calculation. Memory constraints will add serious time and complexity, given if you chunk the data you then need to ensure all pairwise combinations across chunks. – ALollz Sep 10 '19 at 19:30
@ALollz yup agree, I have understanding that dask (https://docs.dask.org/en/latest/dataframe.html) handles chunking and its memory management. – Manvi Sep 10 '19 at 19:43
It does, but I think it gets complicated when things need to be done across chunks, which is the case here. – ALollz Sep 10 '19 at 19:48
1

@ALollz I used Dask array and used lazy computation to get output. It looks like its doing the job. – Manvi Sep 24 '19 at 20:44

score 1 · Answer 2 · answered Sep 24 '19 at 20:41

I used dask array to get output:

arr = df.to_numpy()
x = da.from_array(arr, chunks=(100))
diff = abs(x[:, None] - x)
diff[0:1000].compute()
diff[1000:2000].compute()
diff[2000:3000].compute()

I used Dask lazy computation to generate result for larger than memory arrays.

Efficient way to compute difference of all rows in dask dataframe

2 Answers2