python array intersection efficiently

Question

I don't know how to make an intersect between these two arrays:

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]

result = [[288,24], [193, 12]]

So the intersection is by the first element, the second element of the array is summed, any ideas how to do this efficiently?

Ok i made a mistake for not explaining what i mean about efficient, sorry. Consider the following naive implementation:

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
result = {}
for i, j in a:
    for k, l in b:
        if i == k:
            result[i] = j + l
print result

So i was trying to find a way to make more efficient solution to my problem, more pythonic in a way. So that's why i needed your help.

Try this test cases (my code is also on it):

Test Case

Running time: 28.6980509758

What should the behavior be if there are two matching elements in a? Should they be summed too? For example, `a = [[100, 1], [100, 2]]`, `b = [[50, 1]]` — Brionius, Aug 21 '13 at 15:37
There are distinct elements in a and b. Elements of a might occur in b. — badc0re, Aug 21 '13 at 15:39
Are you saying there will never be duplicates within a or within b? — Brionius, Aug 21 '13 at 15:39
I don't know that I would keep this data as list of lists. Perhaps a dictionary or better yet a counter would make more sense. — cmd, Aug 21 '13 at 16:11
If you are using the word "efficiently" seriously, what alternatives have you tried, and why their performance was inadequate? Also, and very importantly, what is the size of the dataset? — Mario Rossi, Aug 21 '13 at 16:14
@badc0re There seems to be an argument about using expressive (and easy to write, test, and understand) language features v/s efficiency (very important too). Why don't you help us settling this in this particular case? Can you run and time a test of the different answers and publish it? — Mario Rossi, Aug 21 '13 at 16:25
@badc0re And what are your expectations? Why 29 sec is not good? This is very important. Most experienced developers will incrementally improve performance (and decrease readability) to the point performance is acceptable and **not a single bit** more. They value their time and the time of other developers who could come across their code. — Mario Rossi, Aug 21 '13 at 17:06
Well what you mean what is my expectation, as i said i am trying to solve my problem in an efficient way (better performances), sure that there is a trade offs in readability that's why i needed different solutions to see which one fits the most for my case (and for sure not to be the naive solution as i have coded). Also my code is like 28.6980509758 and that's too much time. — badc0re, Aug 21 '13 at 17:12

cmd · Answer 1 · 2014-08-14T19:59:48.697

4

This data seems like it would be better stored as a dictionary

da = dict(a)
db = dict(b)

once you have it like that you can just:

result = [[k, da[k] + db[k]] for k in set(da.keys()).intersection(db.keys())]

or in python 2.7 you can also just use viewkeys instead of a set

result = [[k, da[k] + db[k]] for k in da.viewkeys() & db]

edited Aug 14 '14 at 19:59

answered Aug 21 '13 at 16:17

cmd

5,754
16
30

I haven't downvoted, your solution has so far best performances calculated in 0.0124080181122. – badc0re Aug 21 '13 at 17:01
nice! pythonic and efficient – FUD Aug 21 '13 at 17:20

FUD · Accepted Answer · 2013-08-21T17:33:20.873

3

result = []
ms, mb = (dict(a),dict(b)) if len(a)<len(b) else (dict(b),dict(a))
for k in ms:
  if k in mb:
    result.append([k,ms[k]+mb[k]])

edited Aug 21 '13 at 17:33

answered Aug 21 '13 at 15:34

FUD

5,114
7
39
61

sorry i realized i missed the [] on last statement. – FUD Aug 21 '13 at 17:18
Also added some nice optimization in case you have data whereone array is really small, just my 2 cents – FUD Aug 21 '13 at 17:34
That's pretty quick at 0.0067 seconds on my laptop, with the results verified to match the original. I think it's nicely readable, too. Well done! – Kirk Strauser Aug 21 '13 at 17:37

score 2 · Answer 3 · answered Aug 21 '13 at 15:37

2

Use a counter:

c_sum = Counter()
c_len = Counter()
for elt in a:
    c_sum[elt[0]] += elt[1]
    c_len[elt[0]] += 1

for elt in b:
    c_sum[elt[0]] += elt[1]
    c_len[elt[0]] += 1

print [[k, c_sum[k]] for k, v in c_len.iteritems() if v > 1]

answered Aug 21 '13 at 15:37

guyrt

927
7
12

score 2 · Answer 4 · answered Aug 21 '13 at 15:37

2

Here you go

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]
for e in a:
    for e2 in b:
        if e[0] == e2[0]:
            inter.append([e[0], e[1]+e2[1]])
print inter

Outputs

[[193, 12], [288, 24]]

answered Aug 21 '13 at 15:37

mavili

3,385
4
30
46

3

Maybe you missed the word "efficiently" in the question? I don't think an O(n^2) algorithm qualifies. – Mark Ransom Aug 21 '13 at 15:41
2

I think the OP used the word "efficiently" just out of habit. Or ignorance. If he really had meant it, he would have specified the size of the lists, the general time constraints, what other solutions he had tried with timing information, and even why he's using Python instead of other languages. – Mario Rossi Aug 21 '13 at 16:10
2

It's quite possible that the OP isn't that much concerned about the time complexity of the solution. As they haven't shown any attempts to improve on a basic solution and rather just ask a general question for their problem. – mavili Aug 21 '13 at 16:53

score 2 · Answer 5 · answered Aug 21 '13 at 15:38

This solution works if it you also want duplicate items within the lists to be counted.

from collections import defaultdict

a = [[125, 1], [193, 1], [288, 23]]
b = [[108, 1], [288, 1], [193, 11]]

d = defaultdict(int)

for value, num in a+b:
    d[value] += num
result = filter(lambda x:x[1]>1, d.items())
result = map(list, result)  # If it's important that the result be a list of lists rather than a list of tuples
print result
# [[288, 24], [193, 12]]

score 2 · Answer 6 · answered Aug 21 '13 at 15:49

2

In first place, Python does not have arrays. It has lists. Just a matter of name, but it can be confusing. The one-liner for this is:

[ [ae[0],ae[1]+be[1]] for be in b for ae in a if be[0] == ae[0] ]

PS: As you say "intersection", I assume the lists are set-like ("bags", really), and that , as bags, they are properly normalized (i.e. they don't have repeated elements/keys).

answered Aug 21 '13 at 15:49

Mario Rossi

7,651
27
37

1

One liners are overrated, this is particularly inefficient – FUD Aug 21 '13 at 15:50
1

When you understand them, they have an important place. And if you are so worried about performance, what are you doing developing in Python? Its main advantage over other languages is expressiveness, not performance. Please don't go around discrediting an important part of the language and preventing people from learning about it. Let them draw their own conclusions. In this particular case, I don't think the performance difference is much. Did you time them? – Mario Rossi Aug 21 '13 at 16:02
1

This fails to answer the question posed by the OP. This is not efficient. Making arguments against the language does not mitigate that your solution is an inefficient algorithm. – cmd Aug 21 '13 at 16:07
1

Efficiency is relative to the problem. If for you "expressive" and "not as efficient as other languages" are arguments against Python, then you are right. For me, the first is in favor of Python and the second just statement of a fact. – Mario Rossi Aug 21 '13 at 16:29
1

I would agree that for small data sets your solution works fine and it is expressive. However the OP wrote several times that an efficient algorithm was requested. Leading me to believe that perhaps these list can be very large. The big O notation of your algorithm is n*m. Better can be done and still remain expressive. I also disagree with the statement that just because someone is coding in python they don't care about performance. – cmd Aug 21 '13 at 16:38
My answer ran 786 times faster than OP's, while being more readable. I'd say that performs better by most metrics. – Kirk Strauser Aug 21 '13 at 17:13
@Kirk We still don't know what are the OP's expectations. However, I've always liked your answer, Kirk. It's well constructed, and it uses `defaultdict` and `sum`. Good usage of high-level constructs that could give the OP some material to read. – Mario Rossi Aug 21 '13 at 17:57
1

@MarioRossi Thanks, and I'm really a huge fan of list comprehensions like yours where appropriate. My work code is sprinkled with them. – Kirk Strauser Aug 21 '13 at 18:08

score 2 · Answer 7 · answered Aug 21 '13 at 16:56

2

Here is how I would approach it, assuming uniqueness on a and b:

k = {} # store totals
its = {} # store intersections
for i in a + b:
    if i[0] in k:
        its[i[0]] = True
        k[i[0]] += i[1]
    else:
        k[i[0]] = i[1]
# then loop through intersections for results
result = [[i, k[i]] for i in its]

answered Aug 21 '13 at 16:56

Faisal

2,276
15
19

Also great performances (0.0174601078033) but i don't get the its[i[0]] = True part. – badc0re Aug 21 '13 at 17:05
1

When it hit the key/id the second time, it means its intersected, then we store it instead of doing another loop just to find out if its an intersection. – Faisal Aug 21 '13 at 17:06

score 2 · Answer 8 · answered Aug 21 '13 at 17:11

I got:

from collections import defaultdict
d = defaultdict(list)
for series in a, b:
    for key, value in series:
        d[key].append(value)
result2 = [(key, sum(values)) for key, values in d.iteritems() if len(values) > 1]

which runs in O(len(a)+len(b)), or about 0.02 seconds on my laptop vs 18.79 for yours. I also confirmed that it returned the same results as result.items() from your algorithm.

score 1 · Answer 9 · answered Aug 21 '13 at 17:16

This solution might not be the fastest, but it's probably the simplest implementation, so I decided to post it, for completeness.

aa = Counter(dict(a))
bb = Counter(dict(b))
cc = aa + bb
cc
=> Counter({288: 24, 193: 12, 108: 1, 125: 1})

list(cc.items())
=> [(288, 24), (193, 12), (108, 1), (125, 1)]

If you must only include the common keys:

[ (k, cc[k]) for k in set(aa).intersection(bb) ]
=> [(288, 24), (193, 12)]

score 1 · Answer 10 · answered Aug 21 '13 at 17:21

numpy serachsorted(), argsort(), and intersect1d() are possible alternatives and can be quite fast. This example should also take care of non-unique first element issue.

>>> import numpy as np
>>> a=np.array([[125, 1], [193, 1], [288, 23]])
>>> b=np.array([[108, 1], [288, 1], [193, 11]])
>>> aT=a.T
>>> bT=b.T
>>> aS=aT[0][np.argsort(aT[0])]
>>> bS=bT[0][np.argsort(bT[0])]
>>> i=np.intersect1d(aT[0], bT[0])
>>> cT=np.hstack((aT[:,np.searchsorted(aS, i)], bT[:,np.searchsorted(bS, i)]))
>>> [[item,np.sum(cT[1,np.argwhere(cT[0]==item).flatten()])] for item in i]
[[193, 12], [288, 24]] #not quite happy about this, can someone comes up with a vectorized way of doing it?

Performance will depend on the dimension of `a` and `b` arrays, see this post: http://stackoverflow.com/questions/3650194/are-numpys-math-functions-faster-than-pythons — CT Zhu, Aug 21 '13 at 17:25

python array intersection efficiently

10 Answers10