The problem is that now, I have to use the Posix C getline
function to get the line from the file, only then convert it to a Python Unicode Object using PyUnicode_DecodeUTF8
and cache it using my caching policy algorithm. This process is losing 23% of performance compared to Python builtin for line in file
C implementation.
If I remove the PyUnicode_DecodeUTF8
call from my code, then, my implementation using the Posix C getline
becomes 5%
faster than the Python builtin for line in file
C implementation. So, if I can just make Python directly give me a Python Unicode String object, instead of having to call the Posix C getline
function first (only then convert its result to a Python Unicode Object), my code performance would improve almost by 20%
(from a maximum of 23%
), i.e., it will not become 100%
equivalent to for line in file
performance because I am doing a little work by caching stuff, however this overhead is minimal.
For example, I would like to take the _textiowrapper_readline() function and use it in my code like this:
#include <Python.h>
#include <textio.c.h> // C Python file defininig:
// _textiowrapper_readline(),
// CHECK_ATTACHED(),
// PyUnicode_READY(), etc
typedef struct
{
PyObject_HEAD
}
PyMymoduleExtendingPython;
static PyObject*
PyMymoduleExtendingPython_iternext(PyMymoduleExtendingPython* self, PyObject* args)
{
PyObject *line;
CHECK_ATTACHED(self);
line = _textiowrapper_readline(self, -1); // <- function from `textio.c`
if (line == NULL || PyUnicode_READY(line) == -1)
return NULL;
if (PyUnicode_GET_LENGTH(line) == 0) {
/* Reached EOF or would have blocked */
Py_DECREF(line);
Py_CLEAR(self->snapshot);
self->telling = self->seekable;
return NULL;
}
return line;
}
// create my module
PyMODINIT_FUNC PyInit_mymodule_extending_python_api(void)
{
PyObject* mymodule;
PyMymoduleExtendingPython.tp_iternext =
(iternextfunc) PyMymoduleExtendingPython_iternext;
Py_INCREF( &PyMymoduleExtendingPython );
PyModule_AddObject( mymodule, "FastFile", (PyObject*) &PyMymoduleExtendingPython );
return mymodule;
}
How could I include the textio implementation from C Python and reuse its code on my own Python C Extension/API?
As presented in my last question, How to improve Python C Extensions file line reading?, the Python builtin methods for reading lines are faster than writing my own with C or C++ standard methods to get lines from a file.
On this answer, it was suggested for me to reimplement the Python algorithm by reading chunks of 8KB and only then calling PyUnicode_DecodeUTF8
to decode them, instead of calling PyUnicode_DecodeUTF8
on every line I read.
However, instead of rewriting all C Python code already written/done/ready to read lines, I could just call its "getline" function _textiowrapper_readline()
to directly get the line as a Python Unicode Object, then, cache it/use as I am already doing with the lines I get from Posix C getline
function (and pass to PyUnicode_DecodeUTF8()
decode them into Python Unicode Objects).