I'm currently working on large data, and an important step of the work is to get all sorted combinations of thousand of strings.
In python 3.4, i try three ways for perform this operation.
First, faking the data as a list of string, with or without GO term :
# -*- coding: utf-8 -*-
import random as rd
# With GO term
# data = ["GO:" + str(i) for i in range(0, 10000)]
# Xor, without GO term
data = [str(i) for i in range(0, 10000)]
rd.shuffle(data)
print("shuffle done")
The GO term is just a constant string added to the beginning of all data strings. Results with and without GO term will follows.
Now, perform the benchmark:
from itertools import combinations, product
def test1(): # use condition
g = ((i,j) if (i < j) else (j,i) for i, j in combinations(data, 2))
print(len([(i, j) for i, j in g]))
def test2(): # use filter
g = ((i, j) for i,j in product(data, data) if (i < j))
print(len([(i, j) for i, j in g]))
def test3(): # sort before combine
g = ((i,j) for i, j in combinations(sorted(data), 2))
print(len([(i, j) for i, j in g]))
import timeit
print(timeit.timeit(stmt=test1, number=3))
print(timeit.timeit(stmt=test2, number=3))
print(timeit.timeit(stmt=test3, number=3))
Output with the GO term:
49995000
49995000
49995000
23.490827083587646
49995000
49995000
49995000
31.04393219947815
49995000
49995000
49995000
16.878661155700684
Output without the GO term:
49995000
49995000
49995000
22.99237084388733
49995000
49995000
49995000
29.025460958480835
49995000
49995000
49995000
16.596422910690308
The printed 49995000 for each call proves us that all methods produce the same amount of data. Now the questions :
- why is the first method so quick, compared to the second ? Use of filter is the main way i use for filtering data. Until now, i assume that the filter form was heavily optimized.
- Why the third method seems to be the better ? Sorting is O(N*log(N)), its seems to be something like costly on large lists ?
- Why the GO term impact more the first and the second method than the third one ?
My first guess was that sorting + combinations produce far less comparisons of data, while the two other methods perform a comparison for each couple of string, but, as sorting sounds like an heavy operation, i'm not totally sure about that.