1

I have a C-library which reads binary data from a file, converts it and stores everything in a large char* to return the data to anything calling it. This works fine within C but using python/Cython I run into problems allocating the memory.

The library prototype is: int readWrapper(struct options opt, char *lineOut);

My pyx file I have:

from libc.string cimport strcpy, memset
from libc.stdlib cimport malloc, free
from libc.stdio cimport printf

cdef extern from "reader.h":
    struct options:
        int debug;
        char *filename; 

    options opt
    int readWrapper(options opt, char *lineOut);

def pyreader(file, date, debug=0):
    import logging
    cdef options options
    
    # Get the filename
    options.filename = <char *>malloc(len(file) * sizeof(char)) 
    options.debug = debug
    # Size of array
    outSize = 50000

    cdef char *line_output = <char *> malloc(outSize * sizeof(char))
    memset(line_output, 1, outSize)
    line_output[outSize] = 0

   # Call reader
   return_val = readWrapper(options, line_output)

   # Create dataframe
   from io import StringIO
   data = StringIO(line_output.decode('UTF-8', 'strict'))
   df = pd.read_csv(data, delim_whitespace=True, header=None)
   # Free memory
   free(line_output)
   return df 

This works fine as long as line_output remains within the size of outSize. But some files are larger so how do I do this dynamically?

EDIT after DavidW's suggestions

The reader wrapper is like:

int readWrapper(struct options opt, char **lineOut)
{
    // Open file for reading
    fp = fopen(opt.filename, "r");

    // Check for valid fp
    if (fp == NULL)
    {
        printf("file pointer is null, aborting\n");
        return (EXIT_FAILURE);
    }
    
    // Allocate memory
    int ARRAY_SIZE = 5000;
    *lineOut = NULL;
    char *outLine = malloc(ARRAY_SIZE * sizeof (char));
    if (outLine == NULL) 
    {
        fprintf(stderr, "Memory allocation failed!");
        return(EXIT_FAILURE);
    } 

    // Create line and multi lines object
    char line[255];
    int numWritten = 0;
    int memIncrease = 10000;

    while (fp != feof)
    {
        // Read part of file
        reader(fp, opt, line);
        size_t num2Write = strlen(line);
        if (ARRAY_SIZE < (numWritten + num2Write + 1))
        {   // Won't fit so enlarge outLine
            ARRAY_SIZE += memIncrease;
            outLine = realloc(outLine, (sizeof *outLine * ARRAY_SIZE));
            if (outLine == NULL)
            {
                fprintf(stderr, "Memory re-allocation failed!");
                return(EXIT_FAILURE);
            }
            sprintf(outLine + numWritten, "%s", line);
            numWritten += num2Write;
        }
    } // data block loop
    *lineOut = outLine;

    if (fp != NULL)
    {
        fclose(fp);
    }
    return (EXIT_SUCCESS);
}

The new pyx file:

from libc.string cimport strcpy, memset
from libc.stdlib cimport malloc, free
from libc.stdio cimport printf

cdef extern from "reader.h":
    struct options:
        int debug;
        char *filename; 

    options opt
    int readWrapper(options opt, char *lineOut);

def pyreader(file, date, debug=0):
    import logging
    cdef options options
    
    # Get the filename
    options.filename = <char *>malloc(len(file) * sizeof(char)) 
    options.debug = debug

    cdef char *line_output = NULL

    # Call reader
    return_val = readWrapper(options, &line_output)

    # Create dataframe
    from io import StringIO
    data = StringIO(line_output.decode('UTF-8', 'strict'))
    df = pd.read_csv(data, delim_whitespace=True, header=None)
    
    # Free memory
    free(line_output)
    free(options.filename)
    return df 

This now works great, but using any printf or fprintf(stdout,...) statements in both the wrapper (C) and the python (pyx) part results in

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

when using python3 test.py | head. Without the head no error is shown.

Finally, the suggestions regarding filename and its allocation also don't work for me. Using options.filename = file results in TypeError: expected bytes, str found at runtime but compiles. Interestingly, this only happens when I run the python code that calls the wrapper as such: python3 test.py | head. Without the pipe and head the BrokenPipeError is not present. Hence, it's not a big deal but would like to understand what is causing it.

EDIT after some searching on the BrokenPipeError

This BrokenPipeError issue happens with head and not with tail. An explanation of this "error" can be found here: https://stackoverflow.com/a/30091579/2885280

Solution pyx file:

The final reader.pyx file that works with the before mentioned readWrapper.c. Memory allocation is handled by C and clean-up (at the end) by the pyx code.

from libc.stdlib cimport free

cdef extern from "reader.h":
    struct options:
        int debug;
        char *filename; 
        char *DAY;

    options opt
    int readWrapper(options opt, char **lineOut);

def pyreader(file, date, debug=0):
    import logging
    import sys
    import errno
    import pandas as pd
    # Init return valus
    a = pd.DataFrame()
    cdef options options
    cdef char *line_output = NULL

    # logging
    logging.basicConfig(stream=sys.stdout, 
                        format='%(asctime)s:%(process)d:%(filename)s:%(lineno)s:pyreader: %(message)s',
                        datefmt='%Y%m%d_%H.%M.%S',
                        level=logging.DEBUG if debug > 0 else logging.INFO)
    
    try:        
        # Check inputs
        if file is None:
            raise Exception("No valid filename provided")
        if date is None:
            raise Exception("No valid date provided")

        # Get the filename
        file_enc = file.encode("ascii")
        options.filename = file_enc
        # Get date
        day_enc = date.encode('ascii')
        options.DAY = day_enc
        
        try:
            # Call reader
            return_val = readWrapper(options, &line_output)

            if (return_val > 0):
                logging.error("pyreadASTERIX2 failed with exitcode {}".format(return_val))
                return a
        except Exception:
            logging.exception("Error occurred")
            free(line_output)
            return a

        from io import StringIO
        try:
            data = StringIO(line_output.decode('UTF-8', 'strict'))
            logging.debug("return_val: {} and size: {}".format(return_val, len(line_output.decode('UTF-8', 'strict'))))


            a = pd.read_csv(data, delim_whitespace=True, header=None, dtype={'id':str})
            if a.empty:
                logging.error("failed to load {} not enough data to construct DataFrame".format(file))
                return a 
            logging.debug("converted data into pd")
        except Exception as e:
            logging.exception("Exception occured while loading: {} into DataFrame".format(file))
            return a
        finally:
            free(line_output)
        
        logging.debug("Size of df: {}".format(len(a)))
        # Success, return DataFrame
        return  a
    except Exception:
        logging.exception("pyreader returned with an exception:")
        return a
pdj
  • 169
  • 8
  • `line_output[outSize] = 0` - this is one past the end. Plus you've got a couple of memory leaks too. – DavidW Aug 19 '20 at 15:27
  • 1
    The fundamental problem though is that your C library has a completely broken interface - it writes into a c array without any knowledge about the size of the array. There's no fixing this without dropping this library. – DavidW Aug 19 '20 at 15:30
  • I'm in control of the library fortunately. The library is also used by an executable which prints the data to stdout. I also want the same data available in a pandas dataframe, hence this pyx routine and the use of this char array and StringIO. There's no way that the amount of data to be printed or stored in the dataframe is known beforehand due to the files' binary nature unfortunately. Any ideas how to go around this or use something else then this char*? The unpacked data is a line containing (depending on options) between 20 and 25 "columns" per line. – pdj Aug 19 '20 at 18:19
  • I'd have `readWrapper` be responsible for allocating the memory. Either return the array it use a pointer-to-pointer argument. – DavidW Aug 19 '20 at 19:08

1 Answers1

1

You have two basic options:

  1. Work out how to calculate the size in advance.

     size = calculateSize(...)  # for example, by pre-reading the file
     line_output = <char*>malloc(size)
     return_val = readWrapper(options, line_output)
    
  2. Have readWrapper be responsible for allocating memory. There's two commonly-used patterns in C:

    a. return a pointer (perhaps using NULL to indicate an error):

    char* readWrapper(options opt)
    

    b. Pass a pointer-to-a-pointer and change it

    // C 
    int readWrapper(options opt, char** str_out) {
        // work out the length
        *str_out = malloc(length);
        // etc
    }
    
    # Cython
    char* line_out
    return_value = readWrapper(options, &line_out)
    

You need to ensure that all the strings you allocate are cleaned up. You still have a memory leak for options.filename. For options.filename you're probably better off just getting a pointer to the contents of file through Cython. This is valid as long as file exists so no allocation is needed on your part

options.filename = file

Just make sure that options doesn't outlive file (i.e. it doesn't get stored for later use anywhere in C) .

In general

something = malloc(...)
try:
    # code
finally:
    free(something)

is a good pattern for ensuring clean-up.

DavidW
  • 29,336
  • 6
  • 55
  • 86
  • I went along with option 2b and tried a few things. Unfortunately readWrapper is proprietary and I'm not allowed to share it here but it works as long as I remove all my printf statements. fprintf(stderr, "string here") works fine but with stdout I get BrokenPipeErrors which I really do not understand. Regarding the remains leaks, I freed those too so will edit my OP. Using your suggestions results in: TypeError: expected bytes, str found which I also do not understand. Thanks so far! – pdj Aug 20 '20 at 09:23
  • 1
    I'd guess the encoding error is on `options.filename = file`? That's because Python 3 strings are Unicode so don't convert in a single obvious way to a C string. You probably want `file_enc = file.encode("ascii"); options.filename = file_enc`. Obviously ascii may not be the right encoding... Not a clue about the `BrokenPipeErrors` I'm afraid. – DavidW Aug 20 '20 at 10:02
  • It isn't obvious that you actual set the filename that you open - I wonder if that's the source of some of your problems... – DavidW Aug 20 '20 at 10:08
  • Hi David, I'll try the encoding and report back. Also added a new comment regarding the pipe error which is "solved" but still don't understand the issue. Regarding the file, my question now contains the latest version including the readWrapper function which opens the file. – pdj Aug 20 '20 at 11:01
  • 1
    using `options.filename = file.encode('ascii')` yields a compile error: `Storing unsafe C derivative of temporary Python reference` but using your suggested intermediate file_enc works fine. Is this a Cython issue? – pdj Aug 20 '20 at 11:19
  • 1
    @pdj it's because `options.filename` is just a pointer into the internals of a Python bytes object. You need to keep a reference to the Python object for the whole lifetime of `options`. For the `options.filename = file.encode('ascii')` the return value of `file.encode` is only kept briefly so Cython gives you an error (correctly). – DavidW Aug 20 '20 at 11:47
  • thanks for your support. This is all quite clear now and works nice. What remains is the issue with the pipe which only occurs when I use `| head` and readWrapper writes to stdout and the pyx file prints something as well. For now I can do without that but would like to understand it or workaround it. – pdj Aug 20 '20 at 12:31
  • I don't know about the pipe issue. If I were you I'd ask a new question about it - create the smallest possible [mre] needed to reproduce it (i.e. it probably doesn't need all the reading from files and `malloc` etc). Details about what platform you're on/compiler you're using are probably very relevant too. – DavidW Aug 20 '20 at 13:06
  • 2
    I have fount that the pipe issue is the result of how head works (by closing the pipe "prematurely", e.g., tail doesn't break the pipe). See the answer here: https://stackoverflow.com/a/30091579/2885280 – pdj Aug 24 '20 at 07:07