1

Okay, hear me out here; this isn’t as dumb of a question as you might think.

First, some background: I recently started playing with the ctypes module, and as a tech test I wanted to write a Mandelbrot explorer using pygame and ctypes for event handling and accessing a Mandelbrot calculating dll, respectively. My original plan was to minimize the ctypes wrapper overhead by getting the Mandelbrot function to calculate and store the values for an entire row of pixels in a character array and return a pointer to that array:

Mandelbrot.restype = c_char_p
#...
str_location = Mandelbrot(x)
row = str_location.value

It turned out this didn’t really work though. The value method has two flaws: it degrades performance since it copies the C string byte by byte into the python string, and it doesn’t know the intended length of the string, so any zeroes in the data would be treated as a null terminator, causing the loss of any further data.

My first course of action was hacking together a quick DLL allowing me to disassemble some Python objects. It had the following two functions:

#define DLLINFO extern "C" __declspec(dllexport)
DLLINFO char show_char(char *p)
{
    return *p;
}
DLLINFO void mov(char *p, char payload)
{
    *p = payload;
}

I also packaged the show_char function in a Python function, show_object, which used sys.getsizeof to print the memory contents of a Python object. Disassembling the string revealed a pretty straightforward design:

>>> from hack import *; import sys
>>>
>>> #string experiment
>>> a = '01234567'
>>> hex(sys.getrefcount(a))
'0x3'
>>> hex(id(type(a)))
'0x1e1d81f8'
>>> hex(len(a))
'0x8'
>>> show_object(a)
  3  2  1  0 byte

  0  0  0  4   0    #reference count (+1 temporary reference)
 1e 1d 81 f8   4    #pointer to type
  0  0  0  8   8    #length
 94  b b6 98  12    #???
  0  0  0  1  16    #???
 33 32 31 30  20    #Data '0123' (little endian)
 37 36 35 34  24    #Data '4567'
           0  28    #Null terminator
>>> #sys.getsizeof reported 29 bytes for 9 bytes of data.

(data comments added afterwards)

I tried replacing the string with a mutable bytearray, and I disassembled a bytearray to see where I should write my Mandelbrot data to:

>>> #bytearray experiment
>>> b = bytearray('01234567')
>>> hex(sys.getrefcount(b))
'0x2'
>>> hex(id(type(b)))
'0x1e1e5e20'
>>> hex(len(b))
'0x8'
>>> show_object(b)
  3  2  1  0 byte

  0  0  0  3   0    #reference count (+1 temporary reference)
 1e 1e 5e 20   4    #pointer to type
  0  0  0  8   8    #length
  0  0  0  0  12    #???
  0  0  0  9  16    #???
  2 3a 63 a0  20    #???
  2 92 93 38  24    #???
  2 91 e4 90  28    #???
           1  32    #???
>>> #sys.getsizeof reported 33 bytes for 8 bytes of data

Well, I couldn’t figure out where the data went in the bytearray, so no dice.

My next plan was to replace the string with the mutable string built-in to ctypes, the create_string_buffer.

>>> #buffer experiment
>>> from ctypes import *
>>> c = create_string_buffer('01234567')
>>> hex(id(type(c)))
'0x1ceb778'
>>> show_object(c)
  3  2  1  0 byte

  0  0  0  3   0    #reference count
  1 ce b7 78   4    #pointer to type
  2 38 f7 38   8    #???
  0  0  0  1  12    #Here be dragons
  0  0  0  0  16    #etc.
  0  0  0  9  20
  0  0  0  9  24
  0  0  0  0  28
  0  0  0  0  32
  0  0  0  0  36
 33 32 31 30  40    #data '0123'
 37 36 35 34  44    #data '4567'
  0  0  0  0  48
  0  0  0  0  52
  0  0  0  0  56
  0  0  0  0  60
  2 38 f8 40  64
  2 38 f7 a0  68
 ff ff ff fe  72
  0 2e  0 65  76
>>> #sys.getsizeof reported 80 bytes for 9 bytes of data.

Hmm. At least the data is in there somewhere. Unfortunately, this object’s much too verbose to be practical. Also, it’s not a built-in type, so I had difficulty getting it to work with other functions. This is when I decided to switch back to the string and run some cautious tests modifying the string:

>>> from hack import *
>>> s = "Hello, world!"
>>> show_object(s)
  3  2  1  0 byte

  0  0  0  3   0
 1e 1d 81 f8   4
  0  0  0  d   8
 8f 8d ce 9c  12
  0  0  0  0  16
 6c 6c 65 48  20
 77 20 2c 6f  24
 64 6c 72 6f  28
        0 21  32
>>> mov(id(s)+32, 63)
>>> print s
Hello, world?
>>> mov(id(s)+8,5)
>>> print s
Hello

So far so good. At least nothing crashed the few times I did this. In fact, even modifying the length to a lower value didn’t cause any immediate issue. (I’m not planning to do that though) So, why am I asking this question after laying out this data showing strings are mutable?

First, I know that it is possible for hardware to mark a string as immutable, and attempts to modify them may cause segfault or a similar issue:

char good_string[80];
good_string[8] = '!'; //Everything's okay so far.
char* bad_string = "This string's made out of const chars, beware!";
bad_string[8] = '!'; //And now you've got segfault!

Second and more importantly, I don’t know enough about Python’s inner workings to feel confident bypassing Python’s lock on strings and toying with undefined behavior. Now, it’s easy enough for me to convince myself that the Python FAQ’s stated reasons for string immutability are wrong (I’m not changing the size of strings and strings are not elemental like integers.) , but I do not know if there is some hidden reason strings should not be modified and something will blow up in my face if I try to do what I plan to do. This is the primary reason I submitted this question; I’m hoping someone with more knowledge would care to enlighten me.

Well thanks, you read the whole question. Sorry, brevity is not my strong suit. :)

  • 1
    Side note: didn't read through the question, just the title, but in order for a string to be immutable on the hardware level, it needs to be located in ROM. If a string is located in a read-only memory segment of the executable image, then it is indeed immutable, but on a low-SW level (bus configuration), not on a HW level. And all of this is regardless of Python, C, or any other programming language for that matter. – barak manos Jan 19 '15 at 09:07
  • This _might_ help you - 2 months ago I posted Python code here for [a resumable SHA-256 calculator](http://stackoverflow.com/a/26878137/4014959) that uses ctypes to access libeay32.dll or libssl.so, as appropriate. It uses a custom type called `HashBuffType` to pass a fixed size buffer of unsigned chars called `hashbuff` to the library. The library fills `hashbuff` and the Python code converts it using `bytearray(hashbuff)`. I don't know how efficient it is, but it's pretty fast. – PM 2Ring Jan 19 '15 at 10:09
  • See [Why are python strings and tuples are made immutable?](http://stackoverflow.com/q/1538663/222914) for more background. – Janne Karila Jan 19 '15 at 14:56
  • String interning is used for names of modules, object attributes, variables and even string constants in code objects. In 2.7.9, `ctypes.pythonapi._Py_ReleaseInternedStrings()` reports 3570 interned strings. Never assume any string you get from the interpreter won't be interned, and already hashed. It's fine to peak under the hood to understand how the interpreter works (the source is available and you can download symbol files for use with cdb or windbg), but not to bend the rules based on implementation details. – Eryk Sun Jan 19 '15 at 15:53
  • I don't see why you can't use a ctypes `c_char` array. If you allocate the array in C, you can cast a returned pointer to a ctypes array, without any copying, but I'd prefer to allocate the array in Python to have it reference counted. The array type is fairly capable. It supports the buffer and sequence protocols. But you could use a `bytearray` instead. For example: `string = bytearray(b'abc'); array = (c_char * len(string)).from_buffer(string)`. Those two share the same buffer. – Eryk Sun Jan 19 '15 at 16:06
  • 1
    FYI, here are some links to object definitions, which should be more informative than a memory dump: [`PyStringObject`](https://hg.python.org/cpython/file/648dcafa7e5f/Include/stringobject.h#l35), [`PyByteArrayObject`](https://hg.python.org/cpython/file/648dcafa7e5f/Include/bytearrayobject.h#l22), and ctypes [`CDataObject`](https://hg.python.org/cpython/file/648dcafa7e5f/Modules/_ctypes/ctypes.h#l83) (`b_ptr` points at `b_value` for buffers up to 16 bytes, but otherwise the buffer is separately allocated and won't be contiguous with the object). – Eryk Sun Jan 19 '15 at 16:51

1 Answers1

1

There are some computer systems where an arbitrary range of memory can be tagged as read-only at a hardware level, but that is not what is happening in python. What is happening is that by definition, python prevents strings being changed in place one created.

Yes - it would be perfectly possible, by changing the python code, or providing a new builtin, to write code which allows strings to be mutable in some circumstances, but then you would have real difficulties if you tried to use your mutable strings as dictionary keys for example, and clearly given the way strings are stored, changing the length would by tough (if not impossible in most circumstances - you would need free memory immediately after the current string in order to expand into for instance).

Bear in mind that even with languages with what one might term direct memory access (for instance C), that it's strings are only mutable under certain circumstances : you can change particular characters, but you can't arbritarily extend the length of a C string without either pre-reserving memory for it, or changing it's identity on each change (and then you have problems if you have more than one reference to it).

Tony Suffolk 66
  • 9,358
  • 3
  • 30
  • 33