0

I have 2 large DataFrames. df_a 6.5M rows, df_b 1.2M row, and I am trying to find a match/link between the two. df_a has a string reference that I am trying to find a partial match for in df_b, and return that string.

I can do this with the following code:

df_a = pd.DataFrame({'partial_string': ['abcd','efgh', 'ijkl', 'mnop', 'qrst']})

df_b = pd.DataFrame({'combined_string':['abcd+efgh+1234','abcd+efgh+1234', 'ijkl+1234', 'mnop+1234', 'qrst+1234']})

def find_ref(ref_string):
    ref = 'None'
    df_find = df_b[df_b['combined_string'].str.contains(ref_string)]
    if df_find.size !=0:
        ref = list(df_find['combined_string'].unique())
        if len(ref) == 1:
            ref = ref[0]
        else:
            ref = 'Array'
    return ref

df_a['reference'] = 'None'
df_a['reference'] = df_a['partial_string'].apply(find_ref)
df_a

This approach is workable on small DataFrames but becomes unmanageable with large DataFrames due to the way Pandas work. I have found this reference to improving efficiency with Pandas : https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#enhancingperf

Reading this document I believe my best approach is to convert my Python code to Cython, but that is what I don't know how to do, and I am getting stuck with compilation errors.

This is the code that I have been working on:

%%cython
cimport numpy as np
import numpy as np



cdef str find_ref(str ref_string):
    cdef str ref 'None'
    # Getting stuck with this part
    #(cdef np.ndarray[str] df_find = df_b[df_shopify['combined_string'].str.contains(ref_string)].to_numpy())
    cdef np.ndarray[str] df_find = df_b[df_b['combined_string'].str.contains(ref_string)].to_numpy()
    if df_find.size !=0:
        ref = list(df_find['combined_string'].unique())
        if len(ref) == 1:
            ref = ref[0]
        else:
            # Should not have more than 2 values in Array (Not sure yet how to handle if it does)
            ref='Array' 
    return ref

df_a['reference'] = 'None'
df_a['reference'] = df_a.apply(lambda x: find_ref(x['partial_string']), axis=1)
df_a

I hope someone can help converting it to Cython, or lead me to other solutions I am not aware off. Thanks.

Edit: To hopefully better clarify my question;

Pandas apply() function (and others) are using a single threaded/core process which becomes a time consuming road block in large DataFrames. I think (and could be wrong) that converting a Python function into a Cython function that it will improve processing efficiency and opens up multithreaded computations.

IT Jeroen
  • 13
  • 3
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Apr 03 '23 at 03:09

1 Answers1

1

@jqurious

Your merge suggestion has got me thinking and I managed to use it in an effective way.

df_a = pd.DataFrame({'partial_string': ['abcd','efgh', 'ijkl', 'mnop', 'qrst']})
df_b = pd.DataFrame({'combined_string':['abcd+efgh+1234','abcd+efgh+1234', 'ijkl+1234', 'mnop+1234', 'qrst+1234']})

df_a['df_a_index'] = df_a.index

split_df_b = df_b['combined_string'].str.split('+').apply(pd.Series).stack()
split_df_b.index = split_df_b.index.droplevel(-1)
split_df_b = pd.DataFrame({'df_b_index': split_df_b.index, 'partial_string':split_df_b.values})

df_b['df_b_index'] = df_b.index
split_df_b = split_df_b.merge(df_b, how='left', on='df_b_index')

df_c = split_df_b.merge(df_a, how='left', on='partial_string')
df_c

Although this has improved efficiency by breaking up steps and keeping it within vectorized data, it is less elegant and more confusing than my initial approach.

This is a solution but not an answer to my initial question, and perhaps my initial question isn't very clear.

IT Jeroen
  • 13
  • 3
  • Oh right, `combined_string` is actually partial strings delimited by `+` - I did not realize that. You should be able to use `.explode()` for `split_df_b` in that case: `df_b.assign(partial_string=df_b['combined_string'].str.split('+')).explode('partial_string')` - that should be more efficient. – jqurious Apr 04 '23 at 04:33