optimizing python code for performance

Question

Suppose that song i has been played f_i times but that Zipf’s Law predicts that it would have been played z_i times. Then you deﬁne the quality of song i to be q_i = f_i/z_i .Your software should select the songs with the highest values of q_i.

The ﬁrst line of input contains two integers n and m (1 <= n < 50 000, 1 <= m <= n), the number of songs on the album, and the number of songs to select. Then follow n lines. The i'th of these lines contains an integer f_i and string s_i, where 0 <= f_i< 10^12 is the number of times the i'th song was listened to, and s_i is the name of the song. Each song name is at most 30 characters long and consists only of the characters a-z, 0-9, and underscore (_).

Output a list of the m songs with the highest quality q_i, in decreasing order of quality. If two songs have the same quality, give precedence to the one appearing ﬁrst on the album (presumably there was a reason for the producers to put that song before the other).

sample input 
4 2
30 one
30 two
15 three
25 four


sample output
four
two

I am pretty new to python and i am trying to solve this puzzle I think i am getting the right answer but I have to do it faster, any recommendations ?

from __future__ import division

def main():
    import sys
    from operator import itemgetter

    data = sys.stdin.readlines()
    line1 = data[0].split(" ")
    numberofselect = line1[1]

    qualitydict = {};
    songdict = {};
    a = 0

    for x in range(1, len(data)):
        item = data[x].split(" ");
        item1 = item[1].split("\n");
        f = float(item[0])
        z = float(1/x)
        qualitydict[item1[0]] = (f/z)
        if ((f/z) in songdict.keys()):
            songdict[(f/z)].append(item1[0])
        else:
            songdict[(f/z)] = [item1[0]]

    items = songdict.items()
    items.sort(key = itemgetter(0), reverse=True)

    for key, value in items:
            for element in value:
                if (a < int(numberofselect)):
                    print element
                    a = a + 1

main();

Side note: Taken from a programming challenge: https://www.scrool.se/static/documents/spotify-job-site.pdf — miku, Dec 28 '12 at 01:50
I would encourage you to [*try this method*](http://stackoverflow.com/a/4299378/23771) to find out what part of the code is costing the most time. — Mike Dunlavey, Dec 28 '12 at 02:23
With `from __future__ import division` imported you don't need to cast by `float`. Do `fz = f/z` and replace all `f/z`. — Developer, Dec 28 '12 at 02:40
i think i need to keep the float because i lose precision after some point and therefore get the wrong answer — user1932446, Dec 28 '12 at 06:36

Bakuriu · Accepted Answer · 2012-12-28T08:52:54.310

You can do many improvements, both in readability and performance[not tested]:

from __future__ import division
import sys
from operator import itemgetter
from collections import defaultdict

def main():

    line1 = sys.stdin.readline().split(" ")
    numberofselect = int(line1[1])

    qualitydict = {}
    songdict = defaultdict(list)

    for x, line in enumerate(sys.stdin, start=1):
        tokens = line.split()
        val = float(tokens[0]) * x
        qualitydict[tokens[1]] = val
        songdict[val].append(tokens[1])

    items = songdict.items()
    items.sort(key=itemgetter(0), reverse=True)
    a = 0
    for key, value in items:
            for element in value:
                if a < numberofselect:
                    print element
                    a += 1

main()

In particular:

Use a defaultdict for songdict. It will automatically create a new list value if the key is not present. Also: do not use key in your_dict.keys() to see if a key is in a dictionary, because that check is O(n). Use key in your_dict which takes O(1) time. Note that using a defaultdict you do not have to do the check at all, it's already done for you.
You are defining z as 1/x and then you do f/z, but this is the same as doing f * x, with the only difference that the latter will be more precise(x is an integer, so doing 1/x will lose some precision).
I wonder if it is necessary to sort the items using op.itemgetter(0). I mean, the elements are tuples so they will be sorted first by the first key and secondly by the second key, the result would be the songs you want ordered by quality and alphabetically(when the quality is the same for more than one song). Note that, even though you may think that sorting with op.itemgetter(0) would be faster, I don't think that's necessarily true, since you add a function call for each element and python has to use some space to keep the key values.

In fact if we check the timings:

>>> timeit.timeit('L.sort()', 'import random;L = [(random.randint(0, 100), i) for i in range(3000)]', number=10000)
1.3252038955688477
>>> timeit.timeit('L.sort(key=operator.itemgetter(0))', 'import random;import operator;L = [(random.randint(0, 100), i) for i in range(3000)]', number=10000)
2.926893949508667

Increasing the size the performance for the itemgetter version improves, but you'll have to check carefully at which point it becomes better, because even with 50000 elements:

>>> timeit.timeit('L.sort()', 'import random;L = [(random.randint(0, 1000), i) for i in range(50000)]', number=1000)
13.771193027496338
>>> timeit.timeit('L.sort(key=operator.itemgetter(0))', 'import random;import operator;L = [(random.randint(0, 1000), i) for i in range(50000)]', number=1000)
21.419496059417725

line.split() without arguments split on any whitespace.

For example:

>>> 'A string with   some    space,\ttabs and \n\n newlines'.split()
['A', 'string', 'with', 'some', 'space,', 'tabs', 'and', 'newlines']

Which is quite different from:

>>> 'A string with   some    space,\ttabs and \n\n newlines'.split(' ')
['A', 'string', 'with', '', '', 'some', '', '', '', 'space,\ttabs', 'and', '\n\n', 'newlines']

optimizing python code for performance

1 Answers1