2

In the following problem I have a nested list wp. The N lists in wp contain integer numbers. I want to compute as fast as possible the average number of pairwise different elements between the lists, e.g.

N = 3
wp[0] = [0,1,2]
wp[1] = [0,1]
wp[2] = [3]
--> Different elements (wp[0],wp[1])=1    
--> Different elements (wp[0],wp[2])=4    
--> Different elements (wp[1],wp[2])=3
---> Avg. pairwise different elements = (1+4+3)/3=2.666    

Currently my code is the following:

avg_distance = 0.0
for x1 in range(N):
    for x2 in range(x1 + 1, N):
        distance_temp = 1.0 * len(list(set(wp[x1]) ^ set(wp[x2])))
        avg_distance += distance_temp
avg_distance = 1.0 * avg_distance / (1.0 * N * (N - 1.0) / 2.0)

This is by far the most significant bottleneck in my code. I'm wondering if it could be done faster? I actually use this part within a function that has a numba decorator. Thanks!

HighwayJohn
  • 881
  • 1
  • 9
  • 22

1 Answers1

1

Following 2x faster on shorter lists, 3X on longer lists.

Revised Code

from itertools import combinations as comb

def calc_revised(wp):
  N = len(wp)
  return sum(len(x ^ y) for x, y in comb([set(w) for w in wp], 2))*2/(N*(N-1))

Testing

Posted code as a function

def calc_posted(wp):
  avg_distance = 0.0
  N = len(wp)
  for x1 in range(N):
      for x2 in range(x1 + 1, N):
          distance_temp = 1.0 * len(list(set(wp[x1]) ^ set(wp[x2])))
          avg_distance += distance_temp
  avg_distance = 1.0 * avg_distance / (1.0 * N * (N - 1.0) / 2.0)
  return avg_distance

Timing test original wp (using timeit)

# Data
wp = [[0,1,2], [0,1], [3]]

# Timing
from timeit import timeit
count = 100000
print(timeit(lambda:calc_posted(wp), number=count))
print(timeit(lambda:calc_revised(wp), number=count))
# Output
# 2.2190305839994835 for calc_posted
# 1.1981851930004268 for calc_revised
#
# Revised code ~2X faster

Timing test wp with 1K random sublist, random lengths from 1 to 10

# Data Generation
from random import sample
from random import randint
T = 1000
wp = [sample(range(1, 100), randint(1, 10)) for _ in range(T)]

# timing
from timeit import timeit
count = 1
print(timeit(lambda:calc_posted(wp), number=count))
print(timeit(lambda:calc_revised(wp), number=count))
# Output:
# 3.8808193380000375 for calc_posted
# 1.2002467920001436 for calc_revised
#
# Revised code ~3X faster

List Comprehension

Version based upon list comprehension.

Timing was comparable to calc_revised

def calc_list_comp(wp):
  N = len(wp)
  set_w = [set(w) for w in wp]
  return sum(len(set_w[i] ^ set_w[j]) for i in range(len(set_w)) for j in range(i+1, N))*2/(N*(N-1))
DarrylG
  • 16,732
  • 2
  • 17
  • 23
  • Thanks, this is really fast. Unfortunately it seems that numba does not support itertools yet. Maybe I can still use your suggestions but I have to think about it. Maybe you know what to do about this? Can the code be modified, such that it works in numba as well? – HighwayJohn Jul 02 '20 at 21:48
  • @HighwayJohn--in my past attempts I have found it difficult to get numba to run on code for one reason or another. In this case, [based upon this query](https://stackoverflow.com/questions/61262188/numba-safe-version-of-itertools-combinations) it looks like numba is much slower on a version of combinations that is compatible with numba. – DarrylG Jul 02 '20 at 23:35
  • @HighwayJohn--I added a version based upon list comprehension. It has comparable timing to calc_revised, but doesn't depend upon itertools.combinations. – DarrylG Jul 02 '20 at 23:59
  • @HighwayJohn--does that mean you got it to work with Numba? That's amazing considering my previous Numba attempts with other problems. – DarrylG Jul 03 '20 at 10:10
  • Yes, within numba I can use a slightly modified version of calc_list_comp(wp) :) – HighwayJohn Jul 03 '20 at 10:45