Ahoi. I was tasked to improve performance of Bit.ly's Data_Hacks' sample.py, as a practice excercise.
I have cythonized part of the code. and included a PCG random generator, which has thus far improved performance by about 20 seconds (down from 72s), as well as optimizing print output (by using a basic c function, instead of python's write()
).
This has all worked well, but aside from these fix-ups, I'd like to optimized the loop itself.
The basic function, as seen in bit.ly's sample.py
:
def run(sample_rate):
input_stream = sys.stdin
for line in input_stream:
if random.randint(1,100) <= sample_rate:
sys.stdout.write(line)
My implementation:
cdef int take_sample(float sample_rate):
cdef unsigned int floor = 1
cdef unsigned int top = 100
if pcg32_random() % 100 <= sample_rate:
return 1
else:
return 0
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
if take_sample(sample_rate):
out(line)
What I would like to improve on now, is specifically skipping the next line (and preferably do so repeatedly) if my take_sample() doesn't return True
.
My current implementation is this:
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
out(line)
while not take_sample(sample_rate):
next(f)
Which appears to do nothing to improve performance - leading me to suspect i've merely replaced a continue
call after an if condition at the top of the loop, with my next(f)
.
So the question is this:
Is there a more efficient way to loop over a file (in Cython)?
I'd like to omit lines entirely, meaning they should only be truly accessed if I call my out()
- is this already the case in python's for
loop?
Is line
a pointer (or comparable to such) to the line of the file? Or does the loop actually load this?
I realize that I could improve on it by writing it in C entirely, but I'd like to know how far I can push this staying with python/cython.
Update: I've tested a C variant of my code - using the same test case - and it clocks in at under 2s (surprising no one). So, while it is true that the random generator and file I/O are two major bottlenecks generally speaking, it should be pointed out that python's file handling is in itself already darn slow.
So, is there a way to make use of C's file reading, other than implementing the loop itself into cython? The overhead is still slowing the python code down significantly, which makes me wonder if I'm simply at the sonic wall of performance, when it comes to file handling using Cython?