0

This code runs quickly on the sample data, but when iterating over a large file, it seems to run slowly, perhaps because of the nested for loops. Is there any other reason why iterating over items in a defaultdict is slow?

import itertools

sample_genes1={"0002":["GENE1", "GENE2", "GENE3", "GENE4"],
               "0003":["GENE1", "GENE2", "GENE3", "GENE6"],
               "0202":["GENE4", "GENE2", "GENE1", "GENE7"]}

def get_common_gene_pairs(genelist):
    genedict={}
    for k,v in sample_genes1.items():
        listofpairs=[]
        for i in itertools.combinations(v,2):
            listofpairs.append(i)
            genedict[k]=listofpairs
    return genedict

from collections import namedtuple,defaultdict
def get_gene_pair_pids(genelist):
    i=defaultdict(list)
    d=get_common_gene_pairs(sample_genes1)
    Pub_genes=namedtuple("pair", ["gene1", "gene2"])
    for p_id,genepairs in d.iteritems():
        for p in genepairs:
            thispair=Pub_genes(p[0], p[1])
            if thispair in i.keys():
                i[thispair].append(p_id)
            else:
                i[thispair]=[p_id,]
    return i

if __name__=="__main__":
    get_gene_pair_pids(sample_genes1)
martineau
  • 119,623
  • 25
  • 170
  • 301
  • One of the first things you should do to solve problems like this is profile your code. See [_How can you profile a Python script?_](http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) to find out how easy it is. – martineau Jan 10 '17 at 23:10

2 Answers2

2

One big problem: this line:

if thispair in i.keys():

doesn't take advantage of dictionary search, it's linear search. Drop the keys() call, let the dictionary do its fast lookup:

if thispair in i:

BUT since i is a default dict which creates a list when key doesn't exist, just replace:

if thispair in i.keys():
    i[thispair].append(p_id)  # i is defaultdict: even if thispair isn't in the dict, it will create a list and append p_id.
else:
    i[thispair]=[p_id,]

by this simple statement:

i[thispair].append(p_id)

(it's even faster because there's only one hashing of p_id)

to sum it up:

  • don't do thispair in i.keys(): it's slow, in python 2, or 3, defaultdict or not
  • you have defined a defaultdict, but your code just assumed a dict, which works but slower.

Note: without defaultdict you could have just removed the .keys() or done this with a simple dict:

i.setdefault(thispair,list)
i[thispair].append(p_id)

(here the default item depends on the key)

Aside:

def get_common_gene_pairs(genelist):
    genedict={}
    for k,v in sample_genes1.items():  # should be genelist, not the global variable, you're not using the genelist parameter at all

And you're not using the values of sample_genes1 at all in your code.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
1

Adding to Jean's answer, your get_common_gene_pairs function could be optimized by using dict comprehension as:

def get_common_gene_pairs(genelist):
    return {k : list(itertools.combinations(v,2)) for k,v in genelist.items()}

list.append() is much more time consuming compared to it's list comprhension counter part. Also, you don't have to iterate over itertools.combinations(v,2) in order to convert it to list. Type-casting it to list does that.

Here is the comparison I made between list comprehension and list.append() in answer to Comparing list comprehensions and explicit loops, in case you are interested in taking a look.

Community
  • 1
  • 1
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
  • great! I love learning more about the theoretical basis of why I should be doing things a certain way, thank you very much. I am going to work on adding list comprehensions to my repertoire. – user3089348 Jan 11 '17 at 13:04