2

I am implemeting a function in Cython that requires, at some point to remove some char from a C++ std::string. For this, I would use std::string::erase(). However, when I try to use it, Cython forces the object to be bytes() instead of std::string(), at which point it cannot find .erase().

To illustrate the issue, here is a minimal example (using IPython + Cython magic):

%load_ext Cython
%%cython --cplus -c-O3 -c-march=native -a


from libcpp.string cimport string


cdef string my_func(string s):
    cdef char c = b'\0'
    cdef size_t s_size = s.length()
    cdef size_t i = 0
    while i + 1 <= s_size:
        if s[i] == c:
            s.erase(i, 1)
        i += 1
    return s


def cy_func(string b):
    return my_func(b)

This compiles, but it indicates Python interaction on the .remove() line, and when I try to use it, e.g.

b = b'ciao\0pippo\0'
print(b)
cy_func(b)

I get:

AttributeError Traceback (most recent call last) AttributeError: 'bytes' object has no attribute 'erase'

Exception ignored in: '_cython_magic_5beaeb4004c3afc6d85b9b158c654cb6.my_func' AttributeError: 'bytes' object has no attribute 'erase'

How could I solve this?

Notes

  1. If I replace the s.erase(i, 1) with say s[i] == 10, I get my_func() with no Python interaction (can even use the nogil directive).
  2. I know I could this in Python with .replace(b'\0', b''), but it is part of a longer algorithm I hope to optimize with Cython.
norok2
  • 25,683
  • 4
  • 73
  • 99

2 Answers2

2

I don't know why Cython produces code it is producing - there is even no erase in string.pxd, so Cython should be producing an error.

The easiest workaround would be to introduce a function erase which wrapps std::string::erase:

cdef extern from *:
    """
    #include <string>
    std::string &erase(std::string& s, size_t pos, size_t len){
        return s.erase(pos, len);
    }
    """
    string& erase(string& s, size_t pos, size_t len)

# replace  s.erase(i,1) -> erase(s,i,1)

However, it is not how erasing zeros should be done in C++: it is buggy (see @M.S. answer for a fix) and it has O(n^2) running time (just try it on b"\x00"*10**6), the right way is to use remove/erase-idiom:

%%cython --cplus
from libcpp.string cimport string

cdef extern from *:
    """
    #include <string>
    #include <algorithm>
    void remove_nulls(std::string& s){
       s.erase(std::remove(s.begin(), s.end(), 0), s.end());
    }
    """
    void remove_nulls(string& s)


cdef string my_func(string s):
    remove_nulls(s)
    return s

which is hard to misuse and is O(n).


One more remark, concerning passing `std::string' per value. The signature:

cdef string my_func(string s)
     ...
     return s

means, there are two (unnecessary) copies (with RVO being impossible), it might be better to avoid and pass s by reference (at least in cdef-functions):

def cy_func(string b):
    remove_nulls(b)  # no copying
    return b
ead
  • 32,758
  • 6
  • 90
  • 153
  • hi ead, just wondering, what you mean by "RVO being impossible"? Cython compiled code does not offer return optimization? – Willian Fuks May 09 '20 at 01:21
  • 1
    @WillianFuks, when you take a look at the generated cpp-code for `my_fun` you will see, that it doesn't use copy constructor, but default-constructor+assignment operator - but RVO can elide one the copy constructor. Ok, in this particular case where everything gets inlined the compiler will probably be able to optimize everything, but in generel this is not the case. – ead May 09 '20 at 20:25
  • Thanks! I'll adapt my code to just use functions with input references then. Seems to be the most reliable approach. – Willian Fuks May 09 '20 at 23:35
1

You get access after the array bounds. Fix it, and your code will be working.

The length of the string is decreased after erase. Also the condition i < s_size looks better than i + 1 <= s_size. Finally, i must not be incremented after erase, the new char comes to that index.

while i < s_size:
    if s[i] == c:
        s.erase(i, 1)
        s_size -= 1
    else:
        i += 1

b below is the byte array. Try to call .decode to convert it to string.

b = b'ciao\0pippo\0'
print(b)
cy_func(b.decode('ASCII'))
273K
  • 29,503
  • 10
  • 41
  • 64
  • I copy-pasted your code but it gives me the exact same error. – norok2 Jun 29 '19 at 15:49
  • @norok2 I've updated my answer with the additional change of your code. – 273K Jun 29 '19 at 16:04
  • It still does not work. The most important bit is that it does not follow the specified type and goes with `bytes()` instead of C++'s `string` – norok2 Jun 29 '19 at 16:05