Iterating over a file omitting lines based on condition efficiently

Question

Ahoi. I was tasked to improve performance of Bit.ly's Data_Hacks' sample.py, as a practice excercise.

I have cythonized part of the code. and included a PCG random generator, which has thus far improved performance by about 20 seconds (down from 72s), as well as optimizing print output (by using a basic c function, instead of python's write()).

This has all worked well, but aside from these fix-ups, I'd like to optimized the loop itself.

The basic function, as seen in bit.ly's sample.py:

def run(sample_rate):
    input_stream = sys.stdin
    for line in input_stream:
        if random.randint(1,100) <= sample_rate:
            sys.stdout.write(line)

My implementation:

cdef int take_sample(float sample_rate):
    cdef unsigned int floor = 1
    cdef unsigned int top = 100
    if pcg32_random() % 100 <= sample_rate:
        return 1
    else:
        return 0


def run(float sample_rate, file):
    cdef char* line
    with open(file, 'rb') as f:
        for line in f:
            if take_sample(sample_rate):
                out(line)

What I would like to improve on now, is specifically skipping the next line (and preferably do so repeatedly) if my take_sample() doesn't return True.

My current implementation is this:

def run(float sample_rate, file):
    cdef char* line

    with open(file, 'rb') as f:

        for line in f:
            out(line)
            while not take_sample(sample_rate):
                next(f)

Which appears to do nothing to improve performance - leading me to suspect i've merely replaced a continue call after an if condition at the top of the loop, with my next(f).

So the question is this:

Is there a more efficient way to loop over a file (in Cython)?

I'd like to omit lines entirely, meaning they should only be truly accessed if I call my out() - is this already the case in python's for loop? Is line a pointer (or comparable to such) to the line of the file? Or does the loop actually load this?

I realize that I could improve on it by writing it in C entirely, but I'd like to know how far I can push this staying with python/cython.

Update: I've tested a C variant of my code - using the same test case - and it clocks in at under 2s (surprising no one). So, while it is true that the random generator and file I/O are two major bottlenecks generally speaking, it should be pointed out that python's file handling is in itself already darn slow.

So, is there a way to make use of C's file reading, other than implementing the loop itself into cython? The overhead is still slowing the python code down significantly, which makes me wonder if I'm simply at the sonic wall of performance, when it comes to file handling using Cython?

This writeup might help: http://rabexc.org/posts/io-performance-in-python — Akshat Mahajan, May 25 '16 at 06:32
Bottlenecks are file IO and the random function. So the original script has no substantial potential for optimization. — Daniel, May 25 '16 at 06:34
Each iteration on `f` reads a line. A line is defined as the string of bytes upto the next '\nl'. The only way the reader finds that is by reading the bytes one by one. For a text there's no index of line beginnings that it can use to `seek`. So it can't skip to the next line without reading the current. — hpaulj, May 25 '16 at 07:00
`pcg32_random() % 100 <= sample_rate` does not look equivalent to `random.randint(1,100) <= sample_rate`. It should be either `pcg32_random() % 100 < sample_rate` or `pcg32_random() % 101 <= sample_rate` depending on whether `pcg32_random()` may return 0 or not. — abukaj, May 25 '16 at 07:42
It's also a bad idea to use the `%` with random numbers in general, I just found out. Thanks for the pointer though! — deepbrook, May 25 '16 at 09:06

abukaj · Answer 1 · 2016-05-25T07:45:40.953

0

If the file is small, you may read it whole with .readlines() at once (possibly reducing IO traffic) and iterate the sequence of lines. If the sample rate is small enough, you may consider sampling from geometric distribution which may be more efficient.

I do not know cython, but I would consider also:

simplifying take_sample() by removal of unnecessary variables and returning boolean result of the test instead of integer,
change signature of take_sample() to take_sample(int) to avoid int-to-float conversion every test.

[EDIT]

According to comment of @hpaulj, it may be better if you use .read().split('\n') instead of .readlines() suggested by me.

edited May 25 '16 at 07:45

answered May 25 '16 at 07:01

abukaj

2,582
1
22
45

The files are, unfortunately, quite big; the test case has about 90 Million lines of text. – deepbrook May 25 '16 at 08:01
1

@j4ck Seems you may use file buffering for same purpose: http://stackoverflow.com/questions/29712445/what-is-the-use-of-buffering-in-pythons-built-in-open-function – abukaj May 25 '16 at 08:35

Iterating over a file omitting lines based on condition efficiently

1 Answers1