How to improve Python C Extensions file line reading?

Question

Originally asked on Are there alternative and portable algorithm implementation for reading lines from a file on Windows (Visual Studio Compiler) and Linux? but closed as too abroad, then, I am here trying to reduce its scope with a more concise case usage.

My goal is to implement my own file reading module for Python with Python C Extensions with a line caching policy. The purely Python Algorithm implementation without any line caching policy is this:

# This takes 1 second to parse 100MB of log data
with open('myfile', 'r', errors='replace') as myfile:
    for line in myfile:
        if 'word' in line: 
            pass

Resuming the Python C Extensions implementation: (see here the full code with line caching policy)

// other code to open the file on the std::ifstream object and create the iterator
...

static PyObject * PyFastFile_iternext(PyFastFile* self, PyObject* args)
{
    std::string newline;

    if( std::getline( self->fileifstream, newline ) ) {
        return PyUnicode_DecodeUTF8( newline.c_str(), newline.size(), "replace" );
    }

    PyErr_SetNone( PyExc_StopIteration );
    return NULL;
}

static PyTypeObject PyFastFileType =
{
    PyVarObject_HEAD_INIT( NULL, 0 )
    "fastfilepackage.FastFile" /* tp_name */
};

// create the module
PyMODINIT_FUNC PyInit_fastfilepackage(void)
{
    PyFastFileType.tp_iternext = (iternextfunc) PyFastFile_iternext;
    Py_INCREF( &PyFastFileType );

    PyObject* thismodule;
    // other module code creating the iterator and context manager
    ...

    PyModule_AddObject( thismodule, "FastFile", (PyObject *) &PyFastFileType );
    return thismodule;
}

And this is the Python code which uses the Python C Extensions code to open a file and read its lines one by one:

from fastfilepackage import FastFile

# This takes 3 seconds to parse 100MB of log data
iterable = fastfilepackage.FastFile( 'myfile' )
for item in iterable:
    if 'word' in iterable():
        pass

Right now the Python C Extensions code fastfilepackage.FastFile with C++ 11 std::ifstream takes 3 seconds to parse 100MB of log data, while the Python implementation presented takes 1 second.

The content of the file myfile are just log lines with around 100~300 characters on each line. The characters are just ASCII (module % 256), but due bugs on the logger engine, it can put invalid ASCII or Unicode characters. Hence, this is why I used the errors='replace' policy while opening the file.

I just wonder if I can replace or improve this Python C Extension implementation, reducing the 3 seconds time to run the Python program.

I used this to do the benchmark:

import time
import datetime
import fastfilepackage

# usually a file with 100MB
testfile = './myfile.log'

timenow = time.time()
with open( testfile, 'r', errors='replace' ) as myfile:
    for item in myfile:
        if None:
            var = item

python_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=python_time )
print( 'Python   timedifference', timedifference, flush=True )
# prints about 3 seconds

timenow = time.time()
iterable = fastfilepackage.FastFile( testfile )
for item in iterable:
    if None:
        var = iterable()

fastfile_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=fastfile_time )
print( 'FastFile timedifference', timedifference, flush=True )
# prints about 1 second

print( 'fastfile_time %.2f%%, python_time %.2f%%' % ( 
        fastfile_time/python_time, python_time/fastfile_time ), flush=True )

Related questions:

Is your test code supposed to be `if 'word' in iterable():`, or did you mean `if 'word' in item:` as before? As written, `if 'word' in iterable():` is nonsensical. — ShadowRanger, May 22 '19 at 15:45
Never mind; I checked your full code and it makes sense (sort of) now. It's costing you performance to implement it that way though; I've updated [my answer](https://stackoverflow.com/a/56260851/364696) to address that. — ShadowRanger, May 22 '19 at 16:12
FYI, [this is a memory leak waiting to happen](https://github.com/evandrocoan/fastfilepackage/blob/c919b6084f4a46ad453b9353bcdb052d502fc278/source/fastfile.cpp#L85). If `next` and `call` aren't perfectly paired, then `next` fails to decref the line it discards; you avoid the leak while they're paired only because `call` is eventually returning an owned reference to the string (meaning your cache is now a borrowed reference), so the `pop_front` doesn't leak. And if you call `call` more than once, you return the same pointer twice, with only one owning reference, which will lead to double free. — ShadowRanger, May 22 '19 at 16:29
100MB log file á 100-300 characters per line result in about 500K lines. This means that about 500K times a zero terminated C string is converted into a Python Unicode string and then processed in Python. If performance is the primary objective, it may be better to either have the processing in C and pass only the result to Python, or to perform the processing entirely in Python. Presumable both variants should be faster than this approach with 500K calls into native C. — Stephan Schlecht, May 22 '19 at 18:58
**If `next` and `call` aren't perfectly paired, then next fails to decref the line it discards**, Can I fix this just calling `Py_XDECREF(linecache[0])` before `linecache.pop_front()`? I did a [new commit](https://github.com/evandrocoan/fastfilepackage/commit/d0c823bdbb056847143d752d8c455cae90debb96) with this change. — Evandro Coan, May 23 '19 at 01:39
**And if you call `call` more than once, you return the same pointer twice, with only one owning reference, which will lead to double free**, Currently, I am calling `Py_XINCREF` on every value returned by call on `fastfilewrapper.cpp`. Would not this fix it? — Evandro Coan, May 23 '19 at 01:39
@user: "Currently, I am calling Py_XINCREF on every value returned by call on fastfilewrapper.cpp". Ah, I missed that part. Usually, the person returning the value in the first place increfs it, because borrowed vs. owned is part of the closest level API contract. I'll note that `Py_XDECREF` should probably be `Py_CLEAR` if you want to remove the risk of some other bug causing a double-free (`Py_CLEAR` is `Py_XDECREF` followed by setting the value to `NULL`). — ShadowRanger, May 23 '19 at 15:03
@StephanSchlecht I just benchmarked the `std::getline` against Posix `getline` and `std::getline` is 2.5 times slower than Python builtin readline, while Posix `getline` (not available on msvc) is only 1.5 times slower than Python builtin `getline`. On these results, I am excluding the conversion from `c string` to Python Unicode. When enabling the conversion from `c string` to Python Unicode, it only added 0.4 of slow down, then `std::getline` was 2.9 slower, while Posix `getline` was 1.9 slower. Indeed, the bottleneck is more the getline implementation used other than Python and C interface. — Evandro Coan, May 23 '19 at 16:14
That's interesting. On *nix-like operating systems, when scanning larger text files, I would expect a simple C-loop with 'readline' and 'strstr' to be almost twice as fast as the Python equivalent shown, where C doesn't work with Unicode, so Python actually doesn't do so badly. The functionality on which something like this is based (such as 'readline' and 'strstr') is usually highly optimized and is therefore not easy to beat with custom implementations (unless you have special cases). I would therefore assume that additional operations like string conversion etc. can simply slow them down. — Stephan Schlecht, May 23 '19 at 21:15
@StephanSchlecht, I just noticed I forgot to compile the Posix C `getline` with `-O2`. I was compiling it with `-O0`. Now both Python and Posix C are using `-O2` they matched the `1:1`, i.e., no one was faster than the other. When I also set to do the conversion from the `C String` to `Python Object` the program runs `1.3` times slower. That is what I am losing of performance for not building a Python Unicode string from the start, as the builtin Python method `for line in file` is doing. — Evandro Coan, May 23 '19 at 23:05
@StephanSchlecht, I just managed to optimize my Posix C `getline` usage (by caching the total buffer size instead of always passing 0) and now the Posix C `getline` beats the Python builtin `for line in file` by `5%`. I guess if I remove all Python and C++ code around the Posix C `getline`, it should gain some more performance. — Evandro Coan, May 23 '19 at 23:48
@StephanSchlecht, I also was compiling the benchmarks I did with `std::getline` using `-O0` instead of `-O2`. Sadly, its performance still bad, but now it is only `35%` slower than the Python builtin C `for line in file`. However, I manage to improve the C++ performance to only `9%` slower than the builtin Python C `for line in file` by replacing the usage of `std::getline` by `std::ifstream.getline()`. — Evandro Coan, May 24 '19 at 00:17
@StephanSchlecht, I was not caching the `std::getline` input string, then the performance was bad. Now I posted my codes as [an answer](https://stackoverflow.com/a/56284547/4934640) for proper understanding. — Evandro Coan, May 24 '19 at 01:02
Quicktest of 10M lines with approx 200 chars parsing, counting bytes and lines containing substring 'word': Simple C with readlines/strstr loop, a pure python solution and the last one with C extension (readline in C, subtring search in Python): simple c: 4.31s (3.46s user / 0.84 sys) vs. pure python: 10.81s (9.99 user / 0.80 sys) vs c-extension: 11.28s (10.28 user / 0.96 sys) I expected the difference between pure C and Python solutions, interesting for me (as you said before) is that the C extension solution is almost equivalent to the pure Python solution. — Stephan Schlecht, May 25 '19 at 08:13
For me this means, if performance is the most important criterion, I stil would recommend to do the calculation purely on C side and only return the result. Otherwise you can work purely with Python, because I/O is done by highly optimized routines anyway, so you only get (slightly) slower if you only move I/O to C, but have a much more complex code. — Stephan Schlecht, May 25 '19 at 08:20

ShadowRanger · Accepted Answer · 2019-05-22T16:11:09.510

Reading line by line is going to cause unavoidable slowdowns here. Python's built-in text oriented read-only file objects are actually three layers:

io.FileIO - Raw, unbuffered access to the file
io.BufferedReader - Buffers the underlying FileIO
io.TextIOWrapper - Wraps the BufferedReader to implement buffered decode to str

While iostream does perform buffering, it's only doing the job of io.BufferedReader, not io.TextIOWrapper. io.TextIOWrapper adds an extra layer of buffering, reading 8 KB chunks out of the BufferedReader and decoding them in bulk to str (when a chunk ends in an incomplete character, it saves off the remaining bytes to prepend to the next chunk), then yielding individual lines from the decoded chunk on request until it's exhausted (when a decoded chunk ends in a partial line, the remainder is prepended to the next decoded chunk).

By contrast, you're consuming a line at a time with std::getline, then decoding a line at a time with PyUnicode_DecodeUTF8, then yielding back to the caller; by the time the caller requests the next line, odds are at least some of the code associated with your tp_iternext implementation has left the CPU cache (or at least, left the fastest parts of the cache). A tight loop decoding 8 KB of text to UTF-8 is going to go extremely fast; repeatedly leaving the loop and only decoding a 100-300 bytes at a time is going to be slower.

The solution is to do roughly what io.TextIOWrapper does: Read in chunks, not lines, and decode them in bulk (preserving incomplete UTF-8 encoded characters for the next chunk), then search for newlines to fish out substrings from the decoded buffer until it's exhausted (don't trim the buffer each time, just track indices). When no more complete lines remain in the decoded buffer, trim the stuff you've already yielded, and read, decode, and append a new chunk.

There is some room for improvement on Python's underlying implementation of io.TextIOWrapper.readline (e.g. they have to construct a Python level int each time they read a chunk and call indirectly since they can't guarantee they're wrapping a BufferedReader), but it's a solid basis for reimplementing your own scheme.

Update: On checking your full code (which is wildly different from what you've posted), you've got other issues. Your tp_iternext just repeatedly yields None, requiring you to call your object to retrieve the string. That's... unfortunate. That's more than doubling the Python interpreter overhead per item (tp_iternext is cheap to call, being quite specialized; tp_call is not nearly so cheap, going through convoluted general purpose code paths, requiring the interpreter to pass an empty tuple of args you never use, etc.; side-note, PyFastFile_tp_call should be accepting a third argument for the kwds, which you ignore, but must still be accepted; casting to ternaryfunc is silencing the error, but this will break on some platforms).

Final note (not really relevant to performance for all but the smallest files): The contract for tp_iternext does not require you to set an exception when the iterator is exhausted, just that you return NULL;. You can remove your call to PyErr_SetNone( PyExc_StopIteration );; as long as no other exception is set, return NULL; alone indicates end of iteration, so you can save some work by not setting it at all.

Extremely minor performance tip: As a rule, C++ templates can't specialize to actual functions, only to function prototypes, which means the function ends up being called through a pointer, which is slightly slower than a direct call. Functors (and lambdas, which are syntactic sugar for functors) are unique types though, and the template fully specializes for them. You may gain a little by changing code like `(self->cppobjectpointer)->call( PyUnicode_DecodeUTF8 );` to `(self->cppobjectpointer)->call( [](const char *s, Py_ssize_t sz, const char *e) { return PyUnicode_DecodeUTF8(s, sz, e); });` — ShadowRanger, May 22 '19 at 16:22
**tp_call is not nearly so cheap, going through convoluted general purpose code paths**, I use `call()` because it is much more optimized than a simple `object.function()` call. But now, I stopped returning None from `tp_next` and I am actually returning the first object. Thanks! — Evandro Coan, May 23 '19 at 01:33

Evandro Coan · Answer 2 · 2019-05-27T08:18:11.540

These results are only for Linux or Cygwin compiler. If you are using Visual Studio Compiler, the results for std::getline and std::ifstream.getline are 100% or more slower than Python builtin for line in file iterator.

You will see linecache.push_back( emtpycacheobject ) being used around the code because this way I am only benchmarking the time used to read the lines, excluding the time Python would spend converting the input string into a Python Unicode Object. Therefore, I commented out all lines which call PyUnicode_DecodeUTF8.

These are the global definitions used on the examples:

const char* filepath = "./myfile.log";
size_t linecachesize = 131072;

PyObject* emtpycacheobject;
emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );

I managed to optimize my Posix C getline usage (by caching the total buffer size instead of always passing 0) and now the Posix C getline beats the Python builtin for line in file by 5%. I guess if I remove all Python and C++ code around the Posix C getline, it should gain some more performance:

char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );

if( cfilestream == NULL ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

if( readline == NULL ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {
    ssize_t charsread;
    if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) != -1 ) {
        fileobj.getline( readline, linecachesize );
        // PyObject* pythonobject = PyUnicode_DecodeUTF8( readline, charsread, "replace" );
        // linecache.push_back( pythonobject );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( readline ) {
    free( readline );
    readline = NULL;
}

if( cfilestream != NULL) {
    fclose( cfilestream );
    cfilestream = NULL;
}

I also managed to improve the C++ performance to only 20% slower than the builtin Python C for line in file by using std::ifstream.getline():

char* readline = (char*) malloc( linecachesize );
std::ifstream fileobj;
fileobj.open( filepath );

if( fileobj.fail() ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

if( readline == NULL ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {

    if( !fileobj.eof() ) {
        fileobj.getline( readline, linecachesize );
        // PyObject* pyobj = PyUnicode_DecodeUTF8( readline, fileobj.gcount(), "replace" );
        // linecache.push_back( pyobj );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( readline ) {
    free( readline );
    readline = NULL;
}

if( fileobj.is_open() ) {
    fileobj.close();
}

Finally, I also managed to get only 10% slower performance than the builtin Python C for line in file with std::getline by caching the std::string it uses as input:

std::string line;
std::ifstream fileobj;
fileobj.open( filepath );

if( fileobj.fail() ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

try {
    line.reserve( linecachesize );
}
catch( std::exception error ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {

    if( std::getline( fileobj, line ) ) {
        // PyObject* pyobj = PyUnicode_DecodeUTF8( line.c_str(), line.size(), "replace" );
        // linecache.push_back( pyobj );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( fileobj.is_open() ) {
    fileobj.close();
}

After removing all boilerplate from C++, the performance for Posix C getline was 10% inferior to the Python builtin for line in file:

const char* filepath = "./myfile.log";
size_t linecachesize = 131072;

PyObject* emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );
char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );

static PyObject* PyFastFile_tp_call(PyFastFile* self, PyObject* args, PyObject *kwargs) {
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_iternext(PyFastFile* self, PyObject* args) {
    ssize_t charsread;
    if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) == -1 ) {
        return NULL;
    }
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_getlines(PyFastFile* self, PyObject* args) {
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_resetlines(PyFastFile* self, PyObject* args) {
    Py_INCREF( Py_None );
    return Py_None;
}

static PyObject* PyFastFile_close(PyFastFile* self, PyObject* args) {
    Py_INCREF( Py_None );
    return Py_None;
}

Values from the last test run where Posix C getline was 10% inferior to Python:

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.695292
FastFile timedifference 0:00:00.796305

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.13%, python_time 0.88%
Python   timedifference 0:00:00.708298
FastFile timedifference 0:00:00.803594

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.14%, python_time 0.88%
Python   timedifference 0:00:00.699614
FastFile timedifference 0:00:00.795259

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.699585
FastFile timedifference 0:00:00.802173

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.703085
FastFile timedifference 0:00:00.807528

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.17%, python_time 0.85%
Python   timedifference 0:00:00.677507
FastFile timedifference 0:00:00.794591

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.20%, python_time 0.83%
Python   timedifference 0:00:00.670492
FastFile timedifference 0:00:00.804689

How to improve Python C Extensions file line reading?

2 Answers2

Linked