3

I'm trying to find the differences between an element of a list and all elements of a list for each element of that list. I think that's how I'd describe it? Let's use an example instead:

idx = [1, 4, 8, 10, 22]

Based on this question, it would be simple enough to just zip the elements together, but that would result in only one pairwise comparison per element, whereas I need to make multiple pairwise comparisons for each element.

My approach was to do the following:

diffs = [abs(i-j) for i in idx for j in idx]

But this is incredibly inefficient. What is the most efficient way to calculate the differences for all of the following tuples?

comparisons = [(1,4), (1,8), (1,10), (1, 22), 
               (4,8), (4,10), (4,22), 
               (8,10, 8,22)]

NOTE:

I would prefer answers using base Python 2.7, but would also like to know how to do this with Numpy, as I'm sure that package makes it easier.

Community
  • 1
  • 1
tblznbits
  • 6,602
  • 6
  • 36
  • 66
  • Why is it inefficient? – eyllanesc Dec 31 '16 at 18:23
  • 1
    If you have a list of comparisons -- just iterate over that list. Something like `diffs = [abs(x[i]-x[j]) for i,j in comparisons]` – John Coleman Dec 31 '16 at 18:24
  • I have a list that 25,039 elements long and I need to compare all of them, which is quite time consuming. – tblznbits Dec 31 '16 at 18:25
  • @JohnColeman I don't have a list of comparisons. My first iteration at solving my problem involved generating the list of comparisons, but it felt like it was bogging the program down too much. The `comparison` variable above was just to show what pairwise comparisons I would be looking to make. – tblznbits Dec 31 '16 at 18:26
  • 1
    The set of all comparisons is quadratic in the length of the list. No way to generate them without using a quadratic algorithm. – John Coleman Dec 31 '16 at 18:27
  • @eyllanesc To give some context, calculating the diffs only just now finished, and I ran the line of code immediately after reading your comment. So it took a little over two minutes. – tblznbits Dec 31 '16 at 18:28
  • Is the list named `x` or `idx`? – John Coleman Dec 31 '16 at 18:30
  • You can cut your time by a little more than half by starting with the frequent difference zero then subtracting each number from the ones that *come after it* in the list. That would require a little use of indices and/or slices which would slightly slow the code back down, but you should try this approach. I assume you have the original list and not the list of 2-tuples. – Rory Daulton Dec 31 '16 at 18:31
  • @brittenb You have tried with x = [1, 4, 8, 10, 22] or with another date. – eyllanesc Dec 31 '16 at 18:32
  • @JohnColeman My apologies, question has been updated. – tblznbits Dec 31 '16 at 18:33
  • Do you really need all comparisons in memory at once? A generator approach is natural. – John Coleman Dec 31 '16 at 18:35
  • @JohnColeman It's unlikely that I need all comparisons in memory at once. But I'm relatively new to Python, so I'm not sure of the best way to go about this, which is why I'm posting here. I'm hoping someone can highlight a more Pythonic way of approaching the problem. – tblznbits Dec 31 '16 at 18:39
  • Perhaps you should question why you want to calculate all those differences. – Karoly Horvath Dec 31 '16 at 18:41
  • @KarolyHorvath I've updated the question to include all of the details of the problem I am trying to solve. Please let me know if there is more information I can provide. – tblznbits Dec 31 '16 at 18:51
  • @JohnColeman I don't believe the question changes from what was previously stated to only needing to compare each element with its neighbor. Unless I've grossly overestimated the solution to this problem. I believe I still have to compare each element to every other element after it, which you've highlighted in your answer below. I'm currently looking into `yield` to determine if it is what I need, although it looks as though it is. – tblznbits Dec 31 '16 at 18:56
  • Certainly, you've already tried [`itertools.combinations`](https://docs.python.org/3/library/itertools.html#itertools.combinations)? – tobias_k Dec 31 '16 at 18:58
  • You have brought a `K` into the problem, which means that it is `O(n+K)` rather than `O(n^2)` -- that does change the problem. By the way -- in your edit you said that you were comparing `id[i]` with `idx[i+1]` which is comparing each element with its neighbor. – John Coleman Dec 31 '16 at 18:59
  • Also -- computing the differences isn't the way to go. Keep track of the number of 1's in a sliding window of width K. How many new solutions do you pick up when you slide the window one place? An `O(n)` solution is possible. – John Coleman Dec 31 '16 at 19:03
  • 1
    So it's an [XY-problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378)? – MSeifert Dec 31 '16 at 19:11
  • @MSeifert Oh.. hmm.. yes, it appears so. I had never heard that term before and after all of the comments, and my updates, it definitely appears that's what it is. The question is obviously poorly worded, and I apologize. – tblznbits Dec 31 '16 at 19:14
  • 2
    @brittenb no need to apologize, I think everyone had one XY problem at some point. But to actually resolve your problem you could do either (a) rollback to your original question and _optionally_ accept an answer and then create a new question about your actual problem (b) delete the question altogether and post a new one (c) completly rephrase this question so it contains only parts of your _real_ problem. I wouldn't recommend the last one because it invalidates the already given answers. But it's ultimatly your choice. – MSeifert Dec 31 '16 at 19:21
  • @MSeifert Going with option a) – tblznbits Dec 31 '16 at 19:25

2 Answers2

4

You can write a generator:

idx = [1, 4, 8, 10, 22]

def differences(nums):
    n = len(nums)
    for i in xrange(n-1):
        for j in xrange(i+1,n):
            yield abs(nums[i]-nums[j])

for d in differences(idx): print d

Output:

3
7
9
21
4
6
18
2
14
12

This produces the differences one by one, with very little memory overhead. Also, using the idea of @RoryDaulton , this doesn't bother computing differences that are redundant or known to be zero.

John Coleman
  • 51,337
  • 7
  • 54
  • 119
  • `doesn't bother computing differences that are known to be zero` - Are you sure? You are making assumptions about the input list. – helloV Dec 31 '16 at 19:31
  • @helloV What I meant by that is to not compute differences of the form `abs(dx[i]-dx[i])`. As long as these are finite numbers (a reasonable assumption since the original sample data consisted of integers) such numbers are always 0. `d(x,y) = abs(x,y)` is a metric on the set of integers and the set of real numbers, hence it satisfies `d(x,x) = 0` and `d(x,y) = d(y,x)`. You are correct that I am making *some* assumptions on the data, but in this case they seemed to be implicit in the statement of the problem. – John Coleman Dec 31 '16 at 19:36
  • I was just nitpicking - if the list has duplicate elements. – helloV Dec 31 '16 at 19:40
  • @helloV Fair enough -- I should have said something like I wasn't included differences which are *automatically* zero. Of course if you want the *set* of differences it would make sense to first filter any redundancy out of the original list. – John Coleman Dec 31 '16 at 19:43
  • Thanks for your explanation. Use `xrange` instead of `range` :-) – helloV Dec 31 '16 at 19:52
  • @helloV Good point -- I mostly use Python 3 and sometimes forget those little Python 2 details. I'll change it. Thanks. – John Coleman Dec 31 '16 at 19:56
3

Note that your result will actually hold n*n elements. So if your length is 25,039 (as stated in the comments) you have to take care not to hit a MemoryError with these 630 million elements.

One way would be to go and choose a small enough dtype:

import numpy as np
arr = np.array(your_list, dtype=np.int16)
diffs = arr[:, None] - arr   # this will be a 25039 x 25039 array containing your desired differences.

Even using np.int16 you'll need a lot of memory (I used an array size of 25000 here):

>>> diffs.nbytes
1_250_000_000

If you're operating on int32 you'll roughly need 2.5GB to hold the result - with longs (pythons default if you're not in the unlimited precision integer range) this goes up to 5GB ignoring any python object overhead. Note that the range of values for int16 is only −32768 to 32767 so you may need to choose a bigger dtype if required.

Besides the memory-issues there is no way to beat numpy in terms of efficiency for such operations. It took roughly 1.2s on my computer to calculate the 25k x 25k array of differences.


If you need the number of occurences where the difference is smaller than some value, then numpy again provides an easy solution:

num = np.sum(np.abs(diffs) < some_value) 
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • I crashed my computer trying to build such a list with a naïve Python list comprehension (I'm not sure why the interpreter didn't raise a memory error -- it just became nonresponsive). – John Coleman Dec 31 '16 at 19:19
  • Seems like you use swap memory. Non-responsive is usually the result of (the extremly slow process of) transferring data from RAM to swap memory (and vise-versa). – MSeifert Dec 31 '16 at 19:23