0

I'm trying to create a python function to perform regex lookup on a dataframe named main_df_name. The lookup can be performed on single or multiple columns - this information will be defined by a list called lookup_cols. The search terms are defined in a dataframe called mapping_df. The first column will always be called Role and the search regex patterns could be contained in one or more columns - this information will be defined by a list called keyword_cols. There will be an AND condition between these columns. If the value in keyword_cols is 'NOT', then we need to negate the regex lookup - basically inverting the boolean mask.

My current function design is:

def create_conditions3(main_df_name, mapping_df, lookup_cols, keyword_cols):
  """
  We use np.select to asign priority based lookups (for this, we need to create condition list & label list - this query is taking care of generating condition list)

  Args:
    main_df_name: The name of the main DataFrame.
    mapping_df: The mapping DataFrame.
    lookup_cols: The columns to be used for the lookup.
    keyword_cols: The columns containing the keywords.

  Returns:
    A list of strings representing the conditions.
  """

  final_str = ""
  for row in mapping_df["Role"].unique().tolist():

    if len(keyword_cols) == 1:
      # default case
      command = f"{main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keyword_cols[0]].values[0]}', case=False, regex=True)"
      final_str += f",\n{command}"

    else:
      # Needed for function lookup (Keywords & Keywords2)
      command = f"{main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keyword_cols[0]].values[0]}', case=False, regex=True)"

      for keycol in keyword_cols[1:]:

        if len(mapping_df[mapping_df["Role"] == row][keycol].values[0]) > 0:
          # Non blank value for
          if keycol != "NOT":
            # implemented Keyword2
            command = command + " & " + \
                        f"{main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keycol].values[0]}', case=False, regex=True)"
          else:
            # implemented Negation
            command = command + " & " + \
                        f"~({main_df_name}{lookup_cols}.str.contains('{mapping_df[mapping_df['Role'] == row][keycol].values[0]}', case=False, regex=True))"

      final_str += f",\n{command}"

  final_str = final_str[1:]
  return final_str

My usage for this function is:

function_command = "condlist = " + "[" +create_conditions3('sourcing_profiles',functions, ["combined"],['Keywords','NOT']) + "]"
exec(function_command)
sourcing_profiles['Functions'] = np.select(condlist,functions['Role'].unique().tolist(),"Others")

The current execution time is too high. Is there a more pythonic way to to do this?

  • 2
    please provide a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) to demonstrate what you're trying to achieve – mozway May 20 '23 at 19:03

0 Answers0