0

My goal:

I have a data structure in C++ which holds strings (or more accurately, multi-dimensional char array). I wish to expose this structure to Python via Numpy and Pandas. Eventually the goal is to let the user modify a dataframe which actually modifies the underlying C++ data-structure.

What I've accomplished so far:

I've wrapped the C++ data structure with 2D numpy array (via PyArray_New API call) and returned it into python. Then, inside python I'm using pandas.DataFrame(data=ndarray, columns=columns, copy=False) constructor to wrap the ndarray with pandas' dataframe without copying any data.

I've also managed to modify a single column. For example, I've managed to turn strings into lower case in the following way:

tmp = df["Some_field"].str.decode('ascii').str.lower().str.encode('ascii')
df["Some_field"][:] = tmp

The problem:

I'm now trying to make multiple columns into lower-case. I thought it would be straight forward but I'm struggling with this for a while since the manipulations does not change the underlying numpy arrays.

What I've tried to solve the problem:

fields_to_change = [...]
for field in fields:
        tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
        df[field][:] = tmp

This yields SettingWithCopyWarning and the underlying structure is changed only for the first field in "fields_to_change".

2.

fields_to_change = [...]
for field in fields:
        tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
        df.loc[:, field] = tmp[:]

This runs without errors/warning but again, underlying data is not being changed.

3.

fields_to_change = [...]
for field in fields:
        tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
        np.copyto(dst=df[field].values, src=tmp.values, casting='unsafe')

This works perfectly and changes underlying data. But this code is problematic from a different aspect. The whole point is to expose pandas functionality to transparently modify underlying data. I could copy all values from user's manipulated dataframe into the arrays which hold the underlying data but it would severely slow down my program.

TLDR; my question is:

How can I use pandas to manipulate strings in certain columns without changing the underlying numpy arrays from which the dataframe was composed? Also, is there a way to make sure that the user cannot change underlying numpy arrays?

Thanks very much in advance.

MaBekitsur
  • 171
  • 8
  • Judging from the explanation in [this answer](https://stackoverflow.com/questions/45943160/can-memmap-pandas-series-what-about-a-dataframe), I would guess it is not possible. – Nils Werner Nov 09 '20 at 16:11
  • Sometimes you seem to want to change the data in the underlying arrays, and sometimes you don't. For example, you say 3 works perfectly and changes underlying data, but then at the end you say you want not to change the underlying data, and even want to prevent it from being changed. Could you clarify your aim? – senderle Nov 09 '20 at 18:18
  • @senderle I do want to change the underlying data, not the arrays themselves. The numpy arrays wrap my C++ buffers, so I want to allow the user to change the data inside those buffers, but not allow to use different arrays that will not save the data in my buffers. Basically I want to allow editing the numpy arrays, but not replace them with other arrays. – MaBekitsur Nov 10 '20 at 09:48

0 Answers0