0

Is there a way to have custom C functions act over a pandas DF? I know I can wrap a c function in a python function and use that over row wise iteration, for instance, but that seems inefficient. I know pandas is written in c. I would love a simple way of telling pandas "use this c function". This is naiive, but something like this

...
cFunc = get_c_function_some_how()

for i in range(1000):
    df = df.use_c_function(cFunc)

use_df(df)
...

My use case is that I do a simple, but somewhat computationally expensive set of computations over and over again, and I would like to make that particular set of computations significantly faster

EDIT: I suppose passing the entirety of the Pandas Dataframe somehow to the C function would be fine, realistically the iteration should probably happen inside C anyway, so If a python wrapped c function needs to be used once then the data is just handed over to C for computation, that seems like a pretty good solution. I personally couldn't find documentation on doing something like that.

Warlax56
  • 1,170
  • 5
  • 30

1 Answers1

1

There is a way to do it, but I wouldn't describe it as "easy."

Internally, Pandas uses numpy to store data. If you can get the data as a numpy vector, you can pass that to C, and have it operate on the vector.

Getting a numpy vector from a column is easy:

vec = df["foo"].to_numpy()

Next, you need to ensure that the vector is contiguous. You can't assume that it is, because pandas will store data from multiple columns in the same numpy array if the data has compatible types.

vec = np.ascontiguousarray(vec) 

Then, you can pass the numpy array to C as described in this answer. This will work for numerical data. If you want to work with strings, that's more complicated.

I recommend reading Pandas Under The Hood if you go this route. It explains many important things, like why the numpy arrays are not contiguous.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66