2

I'm currently working on large data, and an important step of the work is to get all sorted combinations of thousand of strings.

In python 3.4, i try three ways for perform this operation.

First, faking the data as a list of string, with or without GO term :

# -*- coding: utf-8 -*-
import random as rd

# With GO term
# data = ["GO:" + str(i) for i in range(0, 10000)]
# Xor, without GO term
data = [str(i) for i in range(0, 10000)]

rd.shuffle(data)
print("shuffle done")

The GO term is just a constant string added to the beginning of all data strings. Results with and without GO term will follows.

Now, perform the benchmark:

from itertools import combinations, product

def test1(): # use condition
    g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
    print(len([(i, j) for i, j in g]))


def test2(): # use filter
    g = ((i, j) for i,j in product(data, data) if (i < j))
    print(len([(i, j) for i, j in g]))


def test3(): # sort before combine
    g = ((i,j) for i, j in combinations(sorted(data), 2))
    print(len([(i, j) for i, j in g]))


import timeit

print(timeit.timeit(stmt=test1, number=3))
print(timeit.timeit(stmt=test2, number=3))
print(timeit.timeit(stmt=test3, number=3))

Output with the GO term:

49995000
49995000
49995000
23.490827083587646
49995000
49995000
49995000
31.04393219947815
49995000
49995000
49995000
16.878661155700684

Output without the GO term:

49995000
49995000
49995000
22.99237084388733
49995000
49995000
49995000
29.025460958480835
49995000
49995000
49995000
16.596422910690308

The printed 49995000 for each call proves us that all methods produce the same amount of data. Now the questions :

  1. why is the first method so quick, compared to the second ? Use of filter is the main way i use for filtering data. Until now, i assume that the filter form was heavily optimized.
  2. Why the third method seems to be the better ? Sorting is O(N*log(N)), its seems to be something like costly on large lists ?
  3. Why the GO term impact more the first and the second method than the third one ?

My first guess was that sorting + combinations produce far less comparisons of data, while the two other methods perform a comparison for each couple of string, but, as sorting sounds like an heavy operation, i'm not totally sure about that.

aluriak
  • 5,559
  • 2
  • 26
  • 39

2 Answers2

1
  1. Really not a big surprise, seeing how combinations creates only the combinations where i<j holds true while product creates all combinations including the ones where i<j is not true - if (i < j) else (j,i) in test1 is redundant. Omitting this check in test1 greatly reduces the execution time, as seen below.

With the i<j check in test1:

shuffle done
49995000
49995000
49995000
31.66194307899991
49995000
49995000
49995000
37.66488860800018
49995000
49995000
49995000
22.706632076000005

Without the i<j check in test1:

shuffle done
49995000
49995000
49995000
25.07709688900013
49995000
49995000
49995000
39.405620851000094
49995000
49995000
49995000
23.54182383899979
EvenLisle
  • 4,672
  • 3
  • 24
  • 47
  • combinations does not only create combinations where `i < j ` unless the iterable provided is sorted. `list(combinations([3,2,1], 2))` will give you `[(3, 2), (3, 1), (2, 1)]` – dting May 25 '15 at 13:01
1

I'm pretty sure the significant contributor to the difference in performance you are observing is from checking if (i < j) 49995000 times vs sorting the list of 10000 elements, rather than the postulated sorted vs unsorted iterable.

combinations should be doing the same amount of work in both cases as they produce the same number of elements and doesn't sort the elements and returns them lexicographically.

To properly test if sorting makes a difference either:

  1. Perform the same conditional check with the same set of data but sorted and not sorted:

    sorted_data = sorted(data)
    
    def test1():
        g = ((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
        return len([(i, j) for i, j in g])
    
    def test2():
        g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
        return len([(i, j) for i, j in g])
    
    %timeit test1()
    1 loops, best of 3: 23.5 s per loop
    
    %timeit test2()
    1 loops, best of 3: 24.6 s per loop
    
  2. Perform the test without the conditional:

    def test3():
        g = ((i,j) for i, j in combinations(sorted_data, 2))
        return len([(i, j) for i, j in g])
    
    def test4():
        g = ((i,j) for i, j in combinations(data, 2))
        return len([(i, j) for i, j in g])
    
    %timeit test3()
    1 loops, best of 3: 20.7 s per loop
    
    %timeit test4()
    1 loops, best of 3: 21.3 s per loop
    

Why is the first method so quick, compared to the second ? Use of filter is the main way i use for filtering data. Until now, i assume that the filter form was heavily optimized.

Using combinations yields less elements that conditional is checked against. 10000C2 = 49995000 for combinations vs 10000**2 = 100000000 for product.

Why the GO term impact more the first and the second method than the third one ?

The first and the second method are impacted by the additional characters for comparison 49995000 and 100000000 times. The third only is impacted by the comparisons needed to sort the 10000 items.


After some fiddling it seems that sorting might make some difference but not as large as using the conditional. No clue as to what causes that.

from itertools import combinations
import random as rd

data = ["{0:04d}".format(i) for i in range(0, 10000)] # Normalize str length
rd.shuffle(data)

sorted_data = sorted(data)
reversed_sorted_data = sorted_data[::-1]

def test1():
    g = list((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
    print('unsorted with conditional: ', len(g))

%timeit test1()
# unsorted with conditional:  49995000
# unsorted with conditional:  49995000
# unsorted with conditional:  49995000
# unsorted with conditional:  49995000
# 1 loops, best of 3: 20.7 s per loop

def test2():
    g = list((i,j) if (i < j) else (j,i) for i, j in combinations(sorted_data, 2))
    print('sorted with conditional: ', len(g))

%timeit test2()
# sorted with conditional:  49995000
# sorted with conditional:  49995000
# sorted with conditional:  49995000
# sorted with conditional:  49995000
# 1 loops, best of 3: 19.6 s per loop

def test3():
    g = list((i,j) for i, j in combinations(data, 2))
    print('unsorted without conditional: ', len(g))
    
%timeit test3()
# unsorted without conditional:  49995000
# unsorted without conditional:  49995000
# unsorted without conditional:  49995000
# unsorted without conditional:  49995000
# 1 loops, best of 3: 15.7 s per loop

def test4():
    g = list((i,j) for i, j in combinations(sorted_data, 2))
    print('sorted without conditional: ', len(g))

%timeit test4()
# sorted without conditional:  49995000
# sorted without conditional:  49995000
# sorted without conditional:  49995000
# sorted without conditional:  49995000
# 1 loops, best of 3: 15.3 s per loop

def test5():
    g = list((i,j) for i, j in combinations(reversed_sorted_data, 2))
    print('reverse sorted without conditional: ', len(g))
    
%timeit test5()
# reverse sorted without conditional:  49995000
# reverse sorted without conditional:  49995000
# reverse sorted without conditional:  49995000
# reverse sorted without conditional:  49995000
# 1 loops, best of 3: 15 s per loop
dting
  • 38,604
  • 10
  • 95
  • 114
  • 1
    As said by jonrsharpe, the difference when sorted can come from branch prediction ; (stackoverflow.com/q/11227809/3001761) – aluriak May 25 '15 at 12:39