Getting stuck converting Python to Cython code for efficiency improvement with Python Pandas

Question

I have 2 large DataFrames. df_a 6.5M rows, df_b 1.2M row, and I am trying to find a match/link between the two. df_a has a string reference that I am trying to find a partial match for in df_b, and return that string.

I can do this with the following code:

df_a = pd.DataFrame({'partial_string': ['abcd','efgh', 'ijkl', 'mnop', 'qrst']})

df_b = pd.DataFrame({'combined_string':['abcd+efgh+1234','abcd+efgh+1234', 'ijkl+1234', 'mnop+1234', 'qrst+1234']})

def find_ref(ref_string):
    ref = 'None'
    df_find = df_b[df_b['combined_string'].str.contains(ref_string)]
    if df_find.size !=0:
        ref = list(df_find['combined_string'].unique())
        if len(ref) == 1:
            ref = ref[0]
        else:
            ref = 'Array'
    return ref

df_a['reference'] = 'None'
df_a['reference'] = df_a['partial_string'].apply(find_ref)
df_a

This approach is workable on small DataFrames but becomes unmanageable with large DataFrames due to the way Pandas work. I have found this reference to improving efficiency with Pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#enhancingperf

Reading this document I believe my best approach is to convert my Python code to Cython, but that is what I don't know how to do, and I am getting stuck with compilation errors.

This is the code that I have been working on:

%%cython
cimport numpy as np
import numpy as np



cdef str find_ref(str ref_string):
    cdef str ref 'None'
    # Getting stuck with this part
    #(cdef np.ndarray[str] df_find = df_b[df_shopify['combined_string'].str.contains(ref_string)].to_numpy())
    cdef np.ndarray[str] df_find = df_b[df_b['combined_string'].str.contains(ref_string)].to_numpy()
    if df_find.size !=0:
        ref = list(df_find['combined_string'].unique())
        if len(ref) == 1:
            ref = ref[0]
        else:
            # Should not have more than 2 values in Array (Not sure yet how to handle if it does)
            ref='Array' 
    return ref

df_a['reference'] = 'None'
df_a['reference'] = df_a.apply(lambda x: find_ref(x['partial_string']), axis=1)
df_a

I hope someone can help converting it to Cython, or lead me to other solutions I am not aware off. Thanks.

Edit: To hopefully better clarify my question;

Pandas apply() function (and others) are using a single threaded/core process which becomes a time consuming road block in large DataFrames. I think (and could be wrong) that converting a Python function into a Cython function that it will improve processing efficiency and opens up multithreaded computations.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Apr 03 '23 at 03:09

score 1 · Answer 1 · answered Apr 04 '23 at 02:08

@jqurious

Your merge suggestion has got me thinking and I managed to use it in an effective way.

df_a = pd.DataFrame({'partial_string': ['abcd','efgh', 'ijkl', 'mnop', 'qrst']})
df_b = pd.DataFrame({'combined_string':['abcd+efgh+1234','abcd+efgh+1234', 'ijkl+1234', 'mnop+1234', 'qrst+1234']})

df_a['df_a_index'] = df_a.index

split_df_b = df_b['combined_string'].str.split('+').apply(pd.Series).stack()
split_df_b.index = split_df_b.index.droplevel(-1)
split_df_b = pd.DataFrame({'df_b_index': split_df_b.index, 'partial_string':split_df_b.values})

df_b['df_b_index'] = df_b.index
split_df_b = split_df_b.merge(df_b, how='left', on='df_b_index')

df_c = split_df_b.merge(df_a, how='left', on='partial_string')
df_c

Although this has improved efficiency by breaking up steps and keeping it within vectorized data, it is less elegant and more confusing than my initial approach.

This is a solution but not an answer to my initial question, and perhaps my initial question isn't very clear.

Oh right, `combined_string` is actually partial strings delimited by `+` - I did not realize that. You should be able to use `.explode()` for `split_df_b` in that case: `df_b.assign(partial_string=df_b['combined_string'].str.split('+')).explode('partial_string')` - that should be more efficient. — jqurious, Apr 04 '23 at 04:33

Getting stuck converting Python to Cython code for efficiency improvement with Python Pandas

1 Answers1