8

Using Cython, I am trying to convert a Python list to a Cython array, and vice versa. The Python list contains numbers from the range 0 - 255, so I specify the type of the array as an unsigned char array. Here is my code to do the conversions:

from libc.stdlib cimport malloc

cdef to_array(list pylist):
    cdef unsigned char *array 
    array = <unsigned char *>malloc(len(pylist) * sizeof(unsigned char))
    cdef long count = 0

    for item in pylist:
        array[count] = item
        count += 1
    return array

cdef to_list(array):
    pylist = [item for item in array]
    return pylist

def donothing(pylist):
    return to_list(to_array(pylist))

The problem lies in the fact that pieces of garbage data are generated in the Cython array, and when converted to Python lists, the garbage data carries over. For example, donothing should do absolutely nothing, and return the python list back to me, unchanged. This function is simply for testing the conversion, but when I run it I get something like:

In[56]:  donothing([2,3,4,5])
Out[56]: [2, 3, 4, 5, 128, 28, 184, 6, 161, 148, 185, 69, 106, 101]

Where is this data coming from in the code, and how can this garbage be cleaned up so no memory is wasted?

P.S. There may be a better version of taking numbers from a Python list and injecting them into an unsigned char array. If so, please direct me to a better method entirely.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
Nick Pandolfi
  • 993
  • 1
  • 8
  • 22
  • http://stackoverflow.com/questions/14780007/python-list-to-cython have u seen this – sundar nataraj Jun 08 '14 at 02:28
  • Yeah, but that question does not have anything to do with the extra bits of data that are added into my lists. The question is in fact different. – Nick Pandolfi Jun 08 '14 at 02:30
  • 1
    Why do you assume that just because the number would fit into an ``unsigned char`` that python is putting them inside one? My guess is that your math for ``malloc(len(pylist) * sizeof(unsigned char))`` isn't doing what you expect it to do. – aruisdante Jun 08 '14 at 02:52
  • That may be the problem, but I dont know what a better data type would be; I dont see how an `unsigned char` is causing the problem. I did the calculations for the multiplication, and I get the exact number of bytes that I would need to store the memory. – Nick Pandolfi Jun 08 '14 at 02:55
  • 1
    My point is that very clearly the size of that array is not being set correctly. That extra stuff is clearly python reading past the end of the array and into junk data. – aruisdante Jun 08 '14 at 03:05
  • Could you possibly provide an answer with an annotated version of the code, so it can be seen where you think the code is going wrong, with the results? – Nick Pandolfi Jun 08 '14 at 03:10
  • 1
    @aruisdante Unfortunately, despite the people agreeing with you, you're off-course. **1.** Cython is casting to a `char` properly. It always will. **2.** `sizeof` is the same as `cython.sizeof` [and exists to work in Pure-Python mode](http://docs.cython.org/src/tutorial/pure.html). **This is not a duplicate.** – Veedrac Jun 08 '14 at 11:11
  • Wouldnt it be nice if admins looked at the two questions before they mark as duplicate – Nick Pandolfi Jun 08 '14 at 13:25
  • 1
    @nickpandolfi Don't go thinking the fact they have close votes to mean that they're admins; it's just a bonus you get for having a certain amount of reputation. Reviews are done in a queue, often on languages you're not familiar with. If you ever find that you things have gone wrong with the voting process and you need to tell someone, just make a post on [Meta](http://meta.stackoverflow.com/) and argue your case there. I've reopened this for you already, though. – Veedrac Jun 08 '14 at 13:55

1 Answers1

3

Your to_array has an untyped return value. Further, you assign the result to an untyped value. As such, Cython is forced to convert char * to a Python type.

Cython converts to bytes, because char is approximately bytes. Unfortunately, without an explicitly-given length Cython assumes that the char * is null-terminated. This is what causes the problem:

convert_lists.donothing([1, 2, 3, 0, 4, 5, 6])
#>>> [1, 2, 3]

When there are no zeroes, Cython will just read until it finds one, going past actually-allocated memory.

You can't actually do for x in my_pointer_arrray for arbitrary Cython types. The for loop actually operates on incorrectly-converted bytes.

You can fix this by typing all values that will hold the char array, passing around the length explicitly and looping over ranges (which will also be faster when the loop variable is typed), or by using a wrapper of some sort. For ideas on what wrapper arrays to use, this question and answer pair has you covered.


Please also note that you should be very careful about errors when using manual allocation. malloc'd data is not garbage collected, so if you error out of a code-path you're going to leak memory. You should check how to handle each specific case.

Community
  • 1
  • 1
Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • Does Cython only assume that because it's a ``char`` array and thus probably a string? Because you really can't assume an array is Null-terminated for any arbitrary list. – aruisdante Jun 08 '14 at 13:12
  • Ah, yeah. I just realised that something else is going on. The same thing is basically happening, but for a slightly different reason. – Veedrac Jun 08 '14 at 13:18