This is my first optimization of the code and I am excited about it. Read some articles but I still have some questions.
1) First of all, what does take so much time in my code below? I think it is array over here: array.append(len(set(line.split()))). I read online that lists in python work faster, but I don't see using lists here. Does anybody know how this can be improved?
2) Are there any other improvements I am missing?
3) Also, online it says that for loop slows down code a lot. Can it be improved here? (I guess write code in C would be best, but :D )
4) Why people suggest to always look at "ncalls" and "tottime"? For me "percall" makes more sense. It tells you how fast is your function or call.
5) Over here in the correct answer Class B he applied lists. Does he? For me I still see an array and a FOR loop that is suppose to slow things down. Fastest way to grow a numpy numeric array
Thank you.
NEW cProfile result:
618384 function calls in 9.966 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
19686 3.927 0.000 4.897 0.000 <ipython-input-120-d8351bb3dd17>:14(f)
78744 3.797 0.000 3.797 0.000 {numpy.core.multiarray.array}
19686 0.948 0.000 0.948 0.000 {range}
19686 0.252 0.000 0.252 0.000 {method 'partition' of 'numpy.ndarray' objects}
19686 0.134 0.000 0.930 0.000 function_base.py:2896(_median)
1 0.126 0.126 9.965 9.965 <ipython-input-120-d8351bb3dd17>:22(<module>)
19686 0.125 0.000 0.351 0.000 _methods.py:53(_mean)
19686 0.120 0.000 0.120 0.000 {method 'reduce' of 'numpy.ufunc' objects}
19686 0.094 0.000 4.793 0.000 function_base.py:2747(_ureduce)
19686 0.071 0.000 0.071 0.000 {method 'flatten' of 'numpy.ndarray' objects}
19686 0.065 0.000 0.065 0.000 {method 'format' of 'str' objects}
78744 0.055 0.000 3.852 0.000 numeric.py:464(asanyarray)
NEW CODE:
import numpy
import cProfile
pr = cProfile.Profile()
pr.enable()
#paths to files
read_path = '../tweet_input/tweets.txt'
write_path = "../tweet_output/ft2.txt"
def f(a):
for i in range(0, len(array)):
if a <= array[i]:
array.insert(i, a)
break
if 0 == len(array):
array.append(a)
try:
with open(read_path) as inf, open(write_path, 'a') as outf:
array = []
#for every line (tweet) in the file
for line in inf: ###Loop is bad. Builtin function is good
#append amount of unique words to the array
wordCount = len(set(line.split()))
#print wordCount, array
f(wordCount)
#write current median of the array to the file
result = "{:.2f}\n".format(numpy.median(array))
outf.write(result)
except IOError as e:
print 'Operation failed: %s' % e.strerror
###Service
pr.disable()
pr.print_stats(sort = 'time')
OLD cProfile result:
551211 function calls in 13.195 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
78744 10.193 0.000 10.193 0.000 {numpy.core.multiarray.array}
OLD CODE:
with open(read_path) as inf, open(write_path, 'a') as outf:
array = []
#for every line in the file
for line in inf:
#append amount of unique words to the array
array.append(len(set(line.split())))
#write current median of the array to the file
result = "{:.2f}\n".format(numpy.median(array))
outf.write(result)