My goal:
I have a data structure in C++ which holds strings (or more accurately, multi-dimensional char array). I wish to expose this structure to Python via Numpy and Pandas. Eventually the goal is to let the user modify a dataframe which actually modifies the underlying C++ data-structure.
What I've accomplished so far:
I've wrapped the C++ data structure with 2D numpy array (via PyArray_New
API call) and returned it into python. Then, inside python I'm using pandas.DataFrame(data=ndarray, columns=columns, copy=False)
constructor to wrap the ndarray with pandas' dataframe without copying any data.
I've also managed to modify a single column. For example, I've managed to turn strings into lower case in the following way:
tmp = df["Some_field"].str.decode('ascii').str.lower().str.encode('ascii')
df["Some_field"][:] = tmp
The problem:
I'm now trying to make multiple columns into lower-case. I thought it would be straight forward but I'm struggling with this for a while since the manipulations does not change the underlying numpy arrays.
What I've tried to solve the problem:
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
df[field][:] = tmp
This yields SettingWithCopyWarning
and the underlying structure is changed only for the first field in "fields_to_change".
2.
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
df.loc[:, field] = tmp[:]
This runs without errors/warning but again, underlying data is not being changed.
3.
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
np.copyto(dst=df[field].values, src=tmp.values, casting='unsafe')
This works perfectly and changes underlying data. But this code is problematic from a different aspect. The whole point is to expose pandas functionality to transparently modify underlying data. I could copy all values from user's manipulated dataframe into the arrays which hold the underlying data but it would severely slow down my program.
TLDR; my question is:
How can I use pandas to manipulate strings in certain columns without changing the underlying numpy arrays from which the dataframe was composed? Also, is there a way to make sure that the user cannot change underlying numpy arrays?
Thanks very much in advance.