0

How can one read/write pandas DataFrames (Numpy arrays) of strings in Cython?

It works just fine when I work with integers or floats:

# Cython file numpy_.pyx
@boundscheck(False)
@wraparound(False)
cpdef fill(np.int64_t[:,::1] arr):
    arr[0,0] = 10
    arr[0,1] = 11
    arr[1,0] = 13
    arr[1,1] = 14
# Python code
import numpy as np
from numpy_ import fill
a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
print(a)
fill(a)
print(a)

gives

>>> a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
>>> print(a)
[[0 1 2]
 [3 4 5]]
>>> fill(a)
>>> print(a)
[[10 11  2]
 [13 14  5]]

Also, the following code

# Python code
import numpy as np, pandas as pd
from numpy_ import fill
a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
df = pd.DataFrame(a)
print(df)
fill(df.values)
print(df)

gives

>>> a = np.array([[0,1,2],[3,4,5]], dtype=np.int64)
>>> df = pd.DataFrame(a)
>>> print(df)
   0  1  2
0  0  1  2
1  3  4  5
>>> fill(df.values)
>>> print(df)
    0   1  2
0  10  11  2
1  13  14  5

However, I am having hard time figuring out how to do the same thing when the input is an array of strings. For example, how can I read of modify a Numpy array or a pandas DataFrame:

a2 = np.array([['000','111','222'],['333','444','555']], dtype='U3')
df2 = pd.DataFrame(a2)

and, let us say, the goal is to change through Cython

'000' -> 'AAA'; '111' -> 'BBB'; '222' -> 'CCC'; '333' -> 'DDD'

I did read the following NumPy documentation page and the following Cython documentation page, but still can not figure out what to do.

Thank you very much for your help!

S.V
  • 2,149
  • 2
  • 18
  • 41
  • `pandas` does not use the `numpy` string dtypes. It makes those series object dtype. Look at `df2.dtypes`. – hpaulj Aug 05 '19 at 17:36
  • @hpaulj So, the declaration of a corresponding function should be `cpdef fill_str(np.object_t[:,::1] arr)`? Why does `type(df2.at[0,0])` then give `` (i.e. not 'object')? – S.V Aug 05 '19 at 17:42
  • `str` is an `object`. A dataframe designed to hold `object` can hold any subclass of `object` including `str` – DavidW Aug 05 '19 at 17:53
  • @DavidW Thank you! If you know what I should read to understand what I need to do to solve my problem, please, let me know! – S.V Aug 05 '19 at 18:03
  • @S.V To be honest this isn't the sort of problem that Cython tends to help with. Any Python code should also work as Cython code so you don't _have_ to type everything, however you may not get much speed-up – DavidW Aug 05 '19 at 18:35
  • @DavidW The problem I am facing is putting into production machine learning models, which were researched in Python. The code needs to be very performant, (type) safe, and preferably interfacable with external C/C++ libraries. My ideas are to use one of the following: Cython, C/C++, or Julia. Cython allows gradually replacing parts of code and keeping the new code close to the original, so I thought it made most sense. C/C++ does not have any reasonable DataFrame implementation, and Julia seems to be not ready yet for production. Am I on a wrong way to my goals? Thank you! – S.V Aug 05 '19 at 19:14
  • 2
    Here's a couple of (maybe) useful links for Numpy arrays of strings https://stackoverflow.com/questions/42543485/cython-specify-numpy-array-of-fixed-length-strings/42544298 https://stackoverflow.com/questions/28774096/cython-memory-view-of-ndarray-of-strings-or-direct-ndarray-indexing. This doesn't necessarily help you with Pandas too much, except that you can force Pandas to have a fixed length string datatype by specifying it in `dtype`. It also doesn't help with Unicode. I don't really have much advice beyond what's in this comment... – DavidW Aug 05 '19 at 19:59

0 Answers0