I have 2 large DataFrames. df_a 6.5M rows, df_b 1.2M row, and I am trying to find a match/link between the two. df_a has a string reference that I am trying to find a partial match for in df_b, and return that string.
I can do this with the following code:
df_a = pd.DataFrame({'partial_string': ['abcd','efgh', 'ijkl', 'mnop', 'qrst']})
df_b = pd.DataFrame({'combined_string':['abcd+efgh+1234','abcd+efgh+1234', 'ijkl+1234', 'mnop+1234', 'qrst+1234']})
def find_ref(ref_string):
ref = 'None'
df_find = df_b[df_b['combined_string'].str.contains(ref_string)]
if df_find.size !=0:
ref = list(df_find['combined_string'].unique())
if len(ref) == 1:
ref = ref[0]
else:
ref = 'Array'
return ref
df_a['reference'] = 'None'
df_a['reference'] = df_a['partial_string'].apply(find_ref)
df_a
This approach is workable on small DataFrames but becomes unmanageable with large DataFrames due to the way Pandas work. I have found this reference to improving efficiency with Pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#enhancingperf
Reading this document I believe my best approach is to convert my Python code to Cython, but that is what I don't know how to do, and I am getting stuck with compilation errors.
This is the code that I have been working on:
%%cython
cimport numpy as np
import numpy as np
cdef str find_ref(str ref_string):
cdef str ref 'None'
# Getting stuck with this part
#(cdef np.ndarray[str] df_find = df_b[df_shopify['combined_string'].str.contains(ref_string)].to_numpy())
cdef np.ndarray[str] df_find = df_b[df_b['combined_string'].str.contains(ref_string)].to_numpy()
if df_find.size !=0:
ref = list(df_find['combined_string'].unique())
if len(ref) == 1:
ref = ref[0]
else:
# Should not have more than 2 values in Array (Not sure yet how to handle if it does)
ref='Array'
return ref
df_a['reference'] = 'None'
df_a['reference'] = df_a.apply(lambda x: find_ref(x['partial_string']), axis=1)
df_a
I hope someone can help converting it to Cython, or lead me to other solutions I am not aware off. Thanks.
Edit: To hopefully better clarify my question;
Pandas apply() function (and others) are using a single threaded/core process which becomes a time consuming road block in large DataFrames. I think (and could be wrong) that converting a Python function into a Cython function that it will improve processing efficiency and opens up multithreaded computations.