I'm trying to create a python function to perform regex lookup on a dataframe named main_df_name. The lookup can be performed on single or multiple columns - this information will be defined by a list called lookup_cols. The search terms are defined in a dataframe called mapping_df. The first column will always be called Role and the search regex patterns could be contained in one or more columns - this information will be defined by a list called keyword_cols. There will be an AND condition between these columns. If the value in keyword_cols is 'NOT', then we need to negate the regex lookup - basically inverting the boolean mask.
My current function design is:
def create_conditions3(main_df_name, mapping_df, lookup_cols, keyword_cols):
"""
We use np.select to asign priority based lookups (for this, we need to create condition list & label list - this query is taking care of generating condition list)
Args:
main_df_name: The name of the main DataFrame.
mapping_df: The mapping DataFrame.
lookup_cols: The columns to be used for the lookup.
keyword_cols: The columns containing the keywords.
Returns:
A list of strings representing the conditions.
"""
final_str = ""
for row in mapping_df["Role"].unique().tolist():
if len(keyword_cols) == 1:
# default case
command = f"{main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keyword_cols[0]].values[0]}', case=False, regex=True)"
final_str += f",\n{command}"
else:
# Needed for function lookup (Keywords & Keywords2)
command = f"{main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keyword_cols[0]].values[0]}', case=False, regex=True)"
for keycol in keyword_cols[1:]:
if len(mapping_df[mapping_df["Role"] == row][keycol].values[0]) > 0:
# Non blank value for
if keycol != "NOT":
# implemented Keyword2
command = command + " & " + \
f"{main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keycol].values[0]}', case=False, regex=True)"
else:
# implemented Negation
command = command + " & " + \
f"~({main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keycol].values[0]}', case=False, regex=True))"
final_str += f",\n{command}"
final_str = final_str[1:]
return final_str
My usage for this function is:
function_command = "condlist = " + "[" +create_conditions3('sourcing_profiles',functions, ["combined"],['Keywords','NOT']) + "]"
exec(function_command)
sourcing_profiles['Functions'] = np.select(condlist,functions['Role'].unique().tolist(),"Others")
The current execution time is too high. Is there a more pythonic way to to do this?