2

I'm trying to use numpy/pandas to constuct a sliding window style comparator. I've got list of lists each of which is a different length. I want to compare each list to to another list as depicted below:

lists = [[10,15,5],[5,10],[5]]

window_diff(l[1],l[0]) = 25

The window diff for lists[0] and lists[1] would give 25 using the following window sliding technique shown in the image below. Because lists[1] is the shorter path we shift it once to the right, resulting in 2 windows of comparison. If you sum the last row in the image below we get the total difference between the two lists using the two windows of comparison; in this case a total of 25. To note we are taking the absolute difference.enter image description here

The function should aggregate the total window_diff between each list and the other lists, so in this case

tot = total_diffs(lists)
tot>>[40, 30, 20]

# where tot[0] represents the sum of lists[0] window_diff with all other lists. 

I wanted to know if there was a quick route to doing this in pandas or numpy. Currently I am using a very long winded process of for looping through each of the lists and then comparing bitwise by shifting the shorter list in accordance to the longer list.

My approach works fine for short lists, but my dataset is 10,000 lists long and some of these lists contain 60 or so datapoints, so speed is a criteria here. I was wondering if numpy, pandas had some advice on this? Thanks

Sample problem data

from random import randint
lists = [[random.randint(0,1000) for r in range(random.randint(0,60))] for x in range(100000)]
kPow989
  • 426
  • 5
  • 22
  • The problem is not completely specified and it is not very obvious. It seems that you do l1 (abs) distance between windows, as it is the only thing that explains `5 - 10 -> 5`. But it is not clear to me your expected behaviour for the last window with just one element. Would you repeat it 3 times, or just 2 padding with zeros the begining and end of it? What would happen when you compare 2nd and 3rd rows? would you slide 2 x 3 times, or just 2 times the 3rd window over the 2nd one? – Imanol Luengo Jun 06 '17 at 08:01
  • Hi Imanol, I have updated the question to address that we are taking abs difference – kPow989 Jun 06 '17 at 08:49

2 Answers2

2

Steps :

  • Among each pair of lists from the input list of lists create sliding windows for the bigger array and then get the absolute difference against the smaller one in that pair. We can use NumPy strides to get those sliding windows.

  • Get the total sum and store this summation as a pair-wise differentiation.

  • Finally sum along each row and col on the 2D array from previous step and their summation is final output.

Thus, the implementation would look something like this -

import itertools

def strided_app(a, L, S=1 ):  # Window len = L, Stride len/stepsize = S
    a = np.asarray(a)
    nrows = ((a.size-L)//S)+1
    n = a.strides[0]
    return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))

N = len(lists)
pair_diff_sums = np.zeros((N,N),dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
    A, B = lists[i], lists[j]
    if len(A)>len(B):
        pair_diff_sums[i,j] = np.abs(strided_app(A,L=len(B)) - B).sum()
    else:
        pair_diff_sums[i,j] = np.abs(strided_app(B,L=len(A)) - A).sum()

out = pair_diff_sums.sum(1) + pair_diff_sums.sum(0)

For really heavy datasets, here's one method using one more level of looping -

N = len(lists)
out = np.zeros((N),dtype=type(lists[0][0]))
for k,i in enumerate(lists):
    for j in lists:
        if len(i)>len(j):
            out[k] += np.abs(strided_app(i,L=len(j)) - j).sum()
        else:
            out[k] += np.abs(strided_app(j,L=len(i)) - i).sum()

strided_app is inspired from here.

Sample input, output -

In [77]: lists
Out[77]: [[10, 15, 5], [5, 10], [5]]

In [78]: pair_diff_sums
Out[78]: 
array([[ 0, 25, 15],
       [25,  0,  5],
       [15,  5,  0]])

In [79]: out
Out[79]: array([40, 30, 20])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    You a wizard :) – Imanol Luengo Jun 06 '17 at 08:19
  • @Divakar, thanks! Your code works pretty well, but I ran into some memory issues with the triu.indices when working the example data which I've added to the question. Thanks for the help! – kPow989 Jun 06 '17 at 08:33
  • @user3063482 @Divakar you could replace the `r, c = np.triu_indices(N)` and the for loop by `for i, j in itertools.combinations(range(N), 2)`. [itertools.combinations](https://docs.python.org/3/library/itertools.html#itertools.combinations) does not precalculate all the indices into the memory and yields them lazily (it is essentially a python generator). You could then shift the line `pair_diff_sums[c,r] = pair_diff_sums[r,c]` inside the for loop as `pair_diff_sums[j, i] = pair_diff_sums[i,j]` – Imanol Luengo Jun 06 '17 at 08:38
  • Also, `pair_diff_sums` which is `N x N` could be ommited for large `N` and have just `diff_sums = np.zeros(N) # out` which would then be updated as `diff_sums[i] += pair_diff; diff_sums[j] += pair_diff` where `pair_diff` is the difference between the row `i` and `j`. – Imanol Luengo Jun 06 '17 at 08:43
  • @ImanolLuengo Thanks for helping out! Made some edits based on your suggestions! – Divakar Jun 06 '17 at 08:48
  • Thanks @Divakar@ImanolLuengo both for your help. Interestingly the earlier approach that Divakar posted was marginally faster..both of them seem to run into memory errors when using the sample problem data above. – kPow989 Jun 06 '17 at 09:01
  • @user3063482 At which step are you getting memory error? – Divakar Jun 06 '17 at 09:03
  • Hi Divakar, I get the memory error at the pair_diff_sums = np.zeros((N,N),dtype=type(lists[0][0])) step. Which I think is because here I create an 100000x100000 array – kPow989 Jun 06 '17 at 09:18
  • @user3063482 Check out the just added : `For really heavy datasets..` section. – Divakar Jun 06 '17 at 09:29
  • Hi Divakar. I really appreciate the help. I do feel that the additional layer of looping has however doubled the execution time. And when running with the heavy sample data above, my code hasn't executed yet. Thanks for all the help though! – kPow989 Jun 06 '17 at 09:38
  • @user3063482 the code takes double time because it does iterate over the whole `N x N` matrix. I did just add an answer extending @Divakar's method by iterating only over the upper triangular matrix (as Divakar suggested) but generated from itertools. – Imanol Luengo Jun 06 '17 at 10:57
2

Just for completeness of @Divakar's great answer and for its application to very large datasets:

import itertools

N = len(lists)
out = np.zeros(N, dtype=type(lists[0][0]))

for i, j in itertools.combinations(range(N), 2):
    A, B = lists[i], lists[j]

    if len(A)>len(B):
        diff = np.abs(strided_app(A,L=len(B)) - B).sum()
    else:
        diff = np.abs(strided_app(B,L=len(A)) - A).sum()

    out[i] += diff
    out[j] += diff

It does not create unnecessary large datasets and updates a single vector while iterating only over the upper triangular array.

It will still take a while to compute, as there is a tradeoff between computational complexity and larger-than-ram datasets. Solutions for larger than ram datasets often rely on iterations, and python is not great at it. Iterating in python over a large dataset is slow, very slow.

Translating the code above to cython could speedup things a bit.

Imanol Luengo
  • 15,366
  • 2
  • 49
  • 67