[Cython]How to improve the performance of mapping List[str] to ndarray[int]?

Question

The input is a sequence of str, and the output is a sequence of int by looking up a mapping: Dict[str, int].

For example, if the input is ['foo', 'bar', 'baz'], and the mapping is {'foo': 1, 'bar': 2, 'baz': 3}, then the result should be [1, 2, 3].

Below is my own implementation in Jupyter notebooks, the performance is better than pure Python version, but still not fast enough. Is there any possibility to further improve the benchmark? Thanks very much!

%%cython -c=-O3
import numpy as np
cimport cython
cimport numpy as np
np.import_array()


INT = np.int

ctypedef np.int_t INT_t


@cython.boundscheck(False)
@cython.wraparound(False)
cpdef func(list arr, dict mapping):
    cdef int n = len(arr)
    cdef np.ndarray[INT_t, ndim=1] ret = np.empty(n, dtype=INT)
    cdef int i
    for i in range(n):
        ret[i] = mapping[arr[i]]
    return ret

It depends a little on what you're prepared to change. You could use C++ standard library containers, but that's only worthwhile if the original data can be stored in them directly. If you need to use a Python list and dict the code you've written is probably as good as you can do. — DavidW, May 12 '22 at 19:55
@DavidW The original data is bytes, and it is converted to Python List by json library. Do you mean I should process bytes by C++, and should I assume passing bytes from Python to C++ is cheap? — Tienan Liu, May 13 '22 at 07:14
Relatively cheap. It's probably the `dict` that it'd be most useful to replace though if possible. — DavidW, May 13 '22 at 07:47
As an example see https://stackoverflow.com/questions/32266444/using-a-dictionary-in-cython-especially-inside-nogil/32267243#32267243 — DavidW, May 13 '22 at 08:40

[Cython]How to improve the performance of mapping List[str] to ndarray[int]?

0 Answers0