4

I'm running my code, that consist in find average values. In a ~6 million lines CSV (ssm_resnik.txt), the first row is one reference, the second row is another and the third value is the 'distance' between the 2 references. Such distances are arbitrarly defined by biological criteria not important for this issue. Most of the the references will be versus... most of the references, hence the huge CSV with more than 6 millions of lines. In another CSV (all_spot_uniprot.txt) i have ~3600 spot's (first column), each of them with 1 or more reference (third column). The values are the same of the huge CSV. I need to compare each of the 3600 of the spot ref of the second file with all other 3600-1 ref's in the same file. All of the possible combinations, if exists, are in the first, huge CSV file (ssm_resnik.txt). For the all_spot_uniprot.txt, each 2 spot ref's will serve as iterator for the correspondent reference (in the third column) and will iterate over the huge CSV file that, if exists, show the value for the two "VS" reference.

What is the problem with my code? Well... For 10 seconds for each iteration, 3600 *3600 *10 = 129.600.000 seconds = 1500 days (almost 5 years). This happens in my core i3, but well in a mac. Below is my code and a portion of each file. Please advice me. There are any code design flaw? There are some way to reduce the computation time? Thanks in advance...

import csv

spot_to_uniprot=open('all_spot_uniprot.txt', 'rbU')
STU=csv.reader(spot_to_uniprot, delimiter='\t')

uniprot_vs_uniprot=open('ssm_resnik.txt', 'rbU')
allVSall= csv.reader(uniprot_vs_uniprot, delimiter='\t')

recorder=open('tabela_final.csv', 'wb')
fout=csv.writer(recorder, delimiter='\t', quotechar='"')

dict_STU={} #dicionário 'spot to uniprot'
dict_corresp={} #for each pair of uniprot ref as key and as value
#a list of lists with the first list as one spot and the second list is the spot+1
dict_corresp_media={}##average of one spot to other
total_correspondencias_x_y=[]
_lista_spot=[]
lista_spot=[]
lista_temp=[]
lista_CSV=[]

for a in STU:
    _lista_spot.append(int(a[0]))
    if a[0] not in dict_STU.keys():
        dict_STU[a[0]]=[]
        dict_STU[a[0]].append(a[2])
    else:        
        dict_STU[a[0]].append(a[2])
n_spot=max(_lista_spot)

spot_to_uniprot.close()

##for aa in _lista_spot:
##    lista_spot.append(int(aa))
##lista_spot.sort()

for i in allVSall:
    lista_CSV.append(i)
tuple_CSV=tuple(lista_CSV)

uniprot_vs_uniprot.close()

for h in range(1, n_spot):
    for _h in range(h+1, n_spot+1):
        #print h, 'h da lista_spot'
        del total_correspondencias_x_y[:]
        total_correspondencias_x_y.append(dict_STU[str(h)])
        #print h, 'h'
        #print _h, '_h'
        #print __h, '__h'
        total_correspondencias_x_y.append(dict_STU[str(_h)])
        print total_correspondencias_x_y, 'total_corresp_x_y'

        for c1 in total_correspondencias_x_y[0]:
            if c1=='No Data':
                pass
            else:
                for c2 in total_correspondencias_x_y[1]:
                    if c2=='No Data':
                        pass
                    else:
                        #print c1, c2, 'c1 e c2'
                        for c3 in lista_CSV:
                            if c1 in c3[0]:
                                if c2 in c3[1]:
                                    lista_temp.append(c3[2])
                                    print lista_temp, 'lista_temp'

        elements=len(lista_temp)
        if len(lista_temp)==0:
            dict_corresp_media[str(h)+'_'+str(_h)]=0
        else:
            temp_d=0
            for d in lista_temp:
                temp_d +=float(d)
            media_spots=temp_d/elements
            dict_corresp_media[str(h)+'_'+str(_h)]=media_spots

        print dict_corresp_media[str(h)+'_'+str(_h)]
        lista_temp=[]

recorder.close()

This is a portion of my files:

all_spot_uniprot.txt

1   spr0001 Q8DRQ4
1   SP0001  O08397
1   SPN01072    B5E568
2   spr0002 P59651
2   SP0002  O06672
2   SPN01074    B5E569
3   spr0005 Q8DRQ2
3   SP0005  Q97TD1
3   SPN01078    B5E572
4   spr0006 Q8DRQ1
4   SP0006  Q97TD0
4   SPN01079    B5E573
5   spr0009 Q8DRQ0
5   SP0009  Q97TC7
6   spr0010 Q8DRP9
6   SP0011  Q97TC5
6   SPN01085    B5E578
7   spr0012 P59652
7   SP0013  O69076
7   SPN01087    B5E580
8   spr0017 Q8DRP6
8   SP0017  No Data
8   SPN01090    B5E5G4
9   spr0020 Q8CZD0
9   SP0018  Q97TC2
9   SPN01093    B5E5G7
10  spr0021 P65888
10  SP0019  P65887
..  ......  ......  ......
..  ......  ......  ......
3617    spr2016 Q8DMY7
3617    spr0324 Q8DR62
3617    SP2211  No Data
3617    SP1311  No Data
3617    SP1441  No Data
3617    SPN11022    No Data
3617    SPN01038    No Data
3617    SPN08246    No Data
3618    spr2018 Q8DMY5
3618    SP0812  No Data
3618    SP2213  No Data
3618    SPN04196    B5E3J0
3618    SPN01040    B5E3V9
3619    spr2040 Q8DMW6
3619    SP2234  Q97N38
3619    SPN01065    B5E462
3620    spr2043 P60243

ssm_resnik.txt

Q8DRQ4  O08397  1.0
Q8DRQ4  B5E568  1.0
Q8DRQ4  P59651  0.12077157944440875
Q8DRQ4  O06672  0.12077157944440875
Q8DRQ4  B5E569  0.12077157944440875
Q8DRQ4  Q8DRQ1  0.12077157944440875
Q8DRQ4  Q97TD0  0.12077157944440875
Q8DRQ4  B5E573  0.12077157944440875
Q8DRQ4  Q8DRP9  0.07139907404780385
Q8DRQ4  Q97TC5  0.07139907404780385
Q8DRQ4  B5E578  0.07139907404780385
Q8DRQ4  P59652  0.04789965413510797
Q8DRQ4  O69076  0.04789965413510797
Q8DRQ4  B5E580  0.04698170092888175
Q8DRQ4  Q8DRP6  0.12077157944440875
Q8DRQ4  P65888  0.05619465373456477
Q8DRQ4  P65887  0.05619465373456477
Q8DRQ4  B5E5G8  0.05619465373456477
Q8DRQ4  Q8DRP3  0.0115283466875553
Q8DRQ4  Q97TC0  0.0115283466875553
Q8DRQ4  B5E5G9  0.0115283466875553
Q8DRQ4  Q8DRP2  0.05619465373456477
Q8DRQ4  Q97TB9  0.05619465373456477
Q8DRQ4  B5E5H1  0.05619465373456477
Q8DRQ4  Q8DRP0  0.12077157944440875
Q8DRQ4  B5E5H3  0.12077157944440875
Q8DRQ4  Q8DNI4  0.12077157944440875
Q8DRQ4  Q8CWP0  0.12077157944440875
Q8DRQ4  Q97CV3  0.12077157944440875
Q8DRQ4  Q97P52  0.12077157944440875
O08397  Q97PH8  0.12077157944440875
O08397  P59200  0.10979991157849654
O08397  P59199  0.10979991157849654
O08397  B5E5I1  0.12077157944440875
O08397  Q8DRN5  0.047725405544042546
O08397  Q97TA8  0.047725405544042546
O08397  B5E5I4  0.047725405544042546
O08397  Q8DRN4  0.1555714706579846
O08397  Q97TA7  0.1555714706579846
O08397  B5E5I5  0.1555714706579846
O08397  Q97TA6  0.02938784938305615
O08397  Q8DRN2  0.02938784938305615
O08397  Q9F7T4  0.02938784938305615
O08397  P59653  0.04191624792292713
O08397  Q03727  0.04191624792292713
O08397  B5E5J1  0.045754049904644475
O08397  P59654  0.01167129073292015
O08397  P36498  0.01167129073292015
O08397  B5E5J2  0.0
O08397  Q8DRM7  0.05619465373456477
O08397  Q07296  0.05619465373456477
O08397  B5E5J3  0.05619465373456477
O08397  Q97TA3  0.05619465373456477
O08397  B5E5J5  0.05619465373456477
O08397  Q97T99  0.05619465373456477
O08397  Q8DRL9  0.05619465373456477
O08397  Q97T96  0.05619465373456477
O08397  B5E5K1  0.05619465373456477
O08397  Q97T95  0.05619465373456477
O08397  Q8DRL7  0.05619465373456477
BioInfoPT
  • 53
  • 3
  • 3
    See if you can move this to a database with some well-picked indices, then recast your problem as a few SQL queries. – Martijn Pieters Apr 19 '13 at 20:50
  • 2
    Python is an interpreted language, you've got boatloads of data, and this is an `O(N^2)` problem. While there might be something wrong with your design, the overall processing time is going to be constrained by those facts. – Robert Harvey Apr 19 '13 at 20:50
  • 1
    From a quick glance, all those nested `for`s scream of a computational complexity `O(scary)`, which, with a nontrivial amount of data and Python's speed, can definitely produce the long computation time you see. From the description I'm not getting exactly what you are trying to do, but probably there are better algorithms... – Matteo Italia Apr 19 '13 at 20:50
  • We seem to have these questions from bioinformatics guys that abuse files as really badly implemented DBs quite a lot in the last time. The problem description isn't that clear to me on the first two glances, so maybe an algorithmic improvement is also possible, but the obvious solution really seems to be to use a DB. – Voo Apr 19 '13 at 20:51
  • 7
    @MatteoItalia how about `O(my_god)`? ;) – Bitwise Apr 19 '13 at 20:51
  • 1
    Alternatively, using a dictionary for the 6 million numbers won't take much memory; key is a tuple of the first two columns. Once in memory, looking up any given key is going to be O(1). Creating a `itertools.product` loop over the other the rows of the file will be a doozy. – Martijn Pieters Apr 19 '13 at 20:51
  • Six million records would fit quite easily into a SQLite file. – Robert Harvey Apr 19 '13 at 20:52
  • @Bitwise: that's it, that's my new favorite computational complexity. :) – Matteo Italia Apr 19 '13 at 20:53
  • You might also try running it in pypy, which could give you an order of magnitude speed up on its own. But you should still certainly follow the other suggestions as well. – mattg Apr 19 '13 at 21:07
  • are c3[0] and c3[1] lists? if so, try changing them to dicts before the triple nested loop. – EHuhtala Apr 19 '13 at 21:30
  • can you share a larger data set? – EHuhtala Apr 19 '13 at 21:55
  • @EHuhtala: The character limit are 30.000 chars. But i have posted some data to the first file, namely the last spots. The last file are artificially modified, because the first column are, for each reference ~3500 lines. I've putted some values of the next reference in place of the first, in the first row, only to you have an idea of the data pattern, because only for the first reference, it exceeds the maximum characters limit granted by the Stackoverflow. Thanks. – BioInfoPT Apr 20 '13 at 00:57
  • well, I advise you go with Martijn's answer, as that's the direction I was headed anyway. – EHuhtala Apr 20 '13 at 01:24

4 Answers4

9

6 million rows can either be held in memory or in a SQLite database. Put it there and make use of the lookup optimizations that offers:

with open('ssm_resnik.txt', 'rbU') as uniprot_vs_uniprot:
    reader = csv.reader(uniprot_vs_uniprot, delimiter='\t')
    allVSall = { tuple(r[:2]): float(r[2]) for r in reader }

Now allVSall is a mapping offering O(1) lookups; this saves you having to loop over the whole 6 million rows for each and every combination you generate. That is a lot of loops saved.

Using a collections.defaultdict is easier when reading the all_spot_uniprot list:

from collections import defaultdict

dict_STU = defaultdict(list)
with open('all_spot_uniprot.txt', 'rbU') as spot_to_uniprot:
    reader = csv.reader(spot_to_uniprot, delimiter='\t')
    for row in reader:
        dict_STU[int(row[0])].append(row[2])

There is no need to find a max value here, simply list the keys and pass these to the itertools.permutations and itertools.product functions to generate combinations.

The following code replicates what yours does in more compact form, with fewer lists, and with O(1) dictionary lookups so fewer loops:

from itertools import permutations, product, ifilter

no_no_data = lambda v: v != 'No Data'
dict_corresp_media = {}

for a, b in permutations(dict_STU.iterkeys(), r=2):
    # retrieve the a and b lists of possible keys, for which we need to loop over their products
    # we filter each for the `No Data` keys
    aval, bval = ifilter(no_no_data, dict_STU[a]), ifilter(no_no_data, dict_STU[b])

    matches = [allVSall[c1, c2] for c1, c2 in product(aval, bval) if (c1, c2) in allVSall]
    dict_corresp_media['{}_{}'.format(a, b)] = sum(matches) / len(matches) if matches else 0

For your input samples, it spits out, in a fraction of a second:

>>> pprint.pprint(dict_corresp_media)
{'10_1': 0,
 '10_2': 0,
 '10_3': 0,
 '10_4': 0,
 '10_5': 0,
 '10_6': 0,
 '10_7': 0,
 '10_8': 0,
 '10_9': 0,
 '1_10': 0.05619465373456477,
 '1_2': 0.12077157944440875,
 '1_3': 0,
 '1_4': 0.12077157944440875,
 '1_5': 0,
 '1_6': 0.07139907404780385,
 '1_7': 0.04759366973303256,
 '1_8': 0.12077157944440875,
 '1_9': 0,
 '2_1': 0,
 '2_10': 0,
 '2_3': 0,
 '2_4': 0,
 '2_5': 0,
 '2_6': 0,
 '2_7': 0,
 '2_8': 0,
 '2_9': 0,
 '3_1': 0,
 '3_10': 0,
 '3_2': 0,
 '3_4': 0,
 '3_5': 0,
 '3_6': 0,
 '3_7': 0,
 '3_8': 0,
 '3_9': 0,
 '4_1': 0,
 '4_10': 0,
 '4_2': 0,
 '4_3': 0,
 '4_5': 0,
 '4_6': 0,
 '4_7': 0,
 '4_8': 0,
 '4_9': 0,
 '5_1': 0,
 '5_10': 0,
 '5_2': 0,
 '5_3': 0,
 '5_4': 0,
 '5_6': 0,
 '5_7': 0,
 '5_8': 0,
 '5_9': 0,
 '6_1': 0,
 '6_10': 0,
 '6_2': 0,
 '6_3': 0,
 '6_4': 0,
 '6_5': 0,
 '6_7': 0,
 '6_8': 0,
 '6_9': 0,
 '7_1': 0,
 '7_10': 0,
 '7_2': 0,
 '7_3': 0,
 '7_4': 0,
 '7_5': 0,
 '7_6': 0,
 '7_8': 0,
 '7_9': 0,
 '8_1': 0,
 '8_10': 0,
 '8_2': 0,
 '8_3': 0,
 '8_4': 0,
 '8_5': 0,
 '8_6': 0,
 '8_7': 0,
 '8_9': 0,
 '9_1': 0,
 '9_10': 0,
 '9_2': 0,
 '9_3': 0,
 '9_4': 0,
 '9_5': 0,
 '9_6': 0,
 '9_7': 0,
 '9_8': 0}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
4

When scripts take extremely long I do:

python -m cProfile myscript.py

I stop the script after a certain amount of time. And the output sums up the time spent in any function and module. Where the figures are the bigger, there the problem lies.

Look there for other profiling methods : How can you profile a python script?

Besides, like the comments suggest you could also consider using a dedicated database.

Community
  • 1
  • 1
Stephane Rolland
  • 38,876
  • 35
  • 121
  • 169
4

It seems like your program's complexity is much higher than your problem's complexity. Using something like cProfile or PyPy (or even reimplementing in C, for that matter) will help you speed up portions of your program, but won't fix the fundamental problem - the nested for loops.

See if you can rewrite the algorithm in a way that is flatter. Remember, this is Python and flat is better than nested.

Thane Brimhall
  • 9,256
  • 7
  • 36
  • 50
  • I concur. It would be a great help if the OP could post a simple example - two rows of one file, two rows of the other, and *what is the program going to do with those values*. – LSerni Apr 19 '13 at 21:11
2

If your code has bad asymptotic complexity, no optimisation will help. Think if you can simplify your problem. For example, do you really need to check all the N*N pairs? maybe you can discard some sets of pairs in advance?

Jakub M.
  • 32,471
  • 48
  • 110
  • 179