3

This is my first optimization of the code and I am excited about it. Read some articles but I still have some questions.

1) First of all, what does take so much time in my code below? I think it is array over here: array.append(len(set(line.split()))). I read online that lists in python work faster, but I don't see using lists here. Does anybody know how this can be improved?

2) Are there any other improvements I am missing?

3) Also, online it says that for loop slows down code a lot. Can it be improved here? (I guess write code in C would be best, but :D )

4) Why people suggest to always look at "ncalls" and "tottime"? For me "percall" makes more sense. It tells you how fast is your function or call.

5) Over here in the correct answer Class B he applied lists. Does he? For me I still see an array and a FOR loop that is suppose to slow things down. Fastest way to grow a numpy numeric array

Thank you.

NEW cProfile result:

 618384 function calls in 9.966 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    19686    3.927    0.000    4.897    0.000 <ipython-input-120-d8351bb3dd17>:14(f)
    78744    3.797    0.000    3.797    0.000 {numpy.core.multiarray.array}
    19686    0.948    0.000    0.948    0.000 {range}
    19686    0.252    0.000    0.252    0.000 {method 'partition' of 'numpy.ndarray' objects}
    19686    0.134    0.000    0.930    0.000 function_base.py:2896(_median)
        1    0.126    0.126    9.965    9.965 <ipython-input-120-d8351bb3dd17>:22(<module>)
    19686    0.125    0.000    0.351    0.000 _methods.py:53(_mean)
    19686    0.120    0.000    0.120    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    19686    0.094    0.000    4.793    0.000 function_base.py:2747(_ureduce)
    19686    0.071    0.000    0.071    0.000 {method 'flatten' of 'numpy.ndarray' objects}
    19686    0.065    0.000    0.065    0.000 {method 'format' of 'str' objects}
    78744    0.055    0.000    3.852    0.000 numeric.py:464(asanyarray)

NEW CODE:

import numpy
import cProfile

pr = cProfile.Profile()
pr.enable()

#paths to files
read_path = '../tweet_input/tweets.txt'
write_path = "../tweet_output/ft2.txt"


def f(a):  
    for i in range(0, len(array)):
        if a <= array[i]:
            array.insert(i, a)
            break
    if 0 == len(array):
        array.append(a)

try:
    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line (tweet) in the file
        for line in inf:                                            ###Loop is bad. Builtin function is good
            #append amount of unique words to the array
            wordCount = len(set(line.split()))
            #print wordCount, array
            f(wordCount)
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)
except IOError as e:
    print 'Operation failed: %s' % e.strerror


###Service
pr.disable()
pr.print_stats(sort = 'time')

OLD cProfile result: 551211 function calls in 13.195 seconds Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function) 78744 10.193 0.000 10.193 0.000 {numpy.core.multiarray.array}

OLD CODE:

    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line in the file
        for line in inf:                            
            #append amount of unique words to the array
            array.append(len(set(line.split())))
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)
Community
  • 1
  • 1
Sergey
  • 67
  • 5

1 Answers1

1

Numpy uses a meadian finding algorithm that is O(n log n). You call numpy.meadian once per line, so your algorithm ends up being O(n^2 log n).

There are several ways to improve on this. One is to keep the array sorted (i.e. insert each element in the position that maintains the sorted order). Each insert takes O(n) (inserting into an array is a linear time operation), and getting the median of a sorted array is O(1), so this ends up being O(n^2).

For profiling, the main thing you want to look at is tottime, since this tells you how much time your program spent in the function in total. As in your example, percall sometimes is not very useful, because sometimes, if you have a slow function (high percall) but it is only called a few times (low numcalls), the tottime ends up being insignificant compared to the other functions.

James
  • 1,198
  • 8
  • 14
  • Did you make this conclusion based on cProfile output? If yes, how do you see it? This line is the only hint why it takes so long "numpy.core.multiarray.array". But it does not say anything about median function. – Sergey Jul 11 '15 at 23:36
  • The median function is part of the numpy library. `{numpy.core.multiarray.array}` in cProfile is the time spent on *everything* in the numpy library. I made the conclusion since you only have one call to a `numpy` library function in your loop- the call to `numpy.median`. You can check this by wrapping the call to `median` in a new function, and see the `tottime` spent in that function in cProfile. – James Jul 11 '15 at 23:57
  • Hm!! Very interesting. I implemented sorted insertion into the array. Now the total time to run the code went down to 9.966 from 11.873. I will post my new solution on the top. Looks like I code spends 3.797 vs 9.309 seconds on {numpy.core.multiarray.array}, but then the function I created takes another 3.927 seconds. Therefore 3.797 + 3.927 = 7.724seconds. Which is kind of the same. Am I doing something wrong? – Sergey Jul 12 '15 at 00:50
  • 1
    First of all, you should try using the built-in [insort_left](https://docs.python.org/2/library/bisect.html), as python's implementation should be more optimized. Second, it looks like `numpy` is still doing a bunch of work for finding the median of a sorted array. `numpy` doesn't know that the array is sorted, and thus will call its internal sort algorithm (I believe it is quicksort) on the sorted array, which happens to be faster since your array is already sorted. However, you can skip `numpy` altogether, since finding the median of a sorted array can be done by just taking the middle – James Jul 12 '15 at 00:55
  • 1
    (continued) index. You can take `len` of the array (which is constant-time), and if `len` is odd, just return `array[len/2]`. If `len` is even, you will need to take the average of the two middle elements. – James Jul 12 '15 at 00:59
  • Great!! Thank you so much for help. I brought it to 0.225 seconds, can you imagine. And it was 11 seconds before. insort() and implementation of your own median worked out – Sergey Jul 12 '15 at 22:43