1

I have some functions in python that share the same structure:

  • Load data from a path
  • do some processing with pandas
  • save results in a csv file

A couple examples:


def generate_report_1(eval_path, output_path):
   df = pd.read_csv(eval_path)
   missclassified_samples = df[df["miss"] == True]
   missclassified_samples.to_csv(output_path)


def generate_report_2(eval_path, output_path):
   df = pd.read_csv(eval_path)
   
   dict_df = df.to_dict()
   
   final_results = {}
   for name, metric in dict_df.items():
      # ... do some processing

   pd.DataFrame(final_results).to_csv(output_path)
   

In ruby, we can use blocks to pause and return to the execution of a function using yield. I would like to know a good practice to accomplish this in python, since this is a case of undesired repeated code.

Thanks.

heresthebuzz
  • 678
  • 7
  • 21
  • Could you maybe provide examples of two of the functions that you're trying to DRY? Based on your description it's not clear what the differences are between your functions (i.e. what's the part of the code that *isn't* repeated?), but off the top of my head I might suggest using a context manager, which I think is the equivalent of what you're talking about with Ruby. – Samwise May 14 '21 at 14:40
  • Sure, i will add some examples. – heresthebuzz May 14 '21 at 14:42
  • Separate the code you repeat into a different fucntion? – lllrnr101 May 14 '21 at 14:47
  • In case you didn't know, [yield](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do) exists in python too. – Confused Learner May 14 '21 at 15:03
  • It looks like the only part of these two functions that's the same is the `read_csv` part. I'm not sure if there's any value in wrapping that single line in another function. – Samwise May 14 '21 at 15:49

1 Answers1

1

No special construct is needed, just plain Python functions.
The only trick is passing a processing function as a parameter to your report function, thusly:

def generate_report(eval_path, processfunc, output_path):
   df = pd.read_csv(eval_path)
   result = processfunc(df)
   result.to_csv(output_path)

def process_1(df):
   return df[df["miss"] == True]

def process_2(df):
   dict_df = df.to_dict()
   final_results = {}
   for name, metric in dict_df.items():
      # ... do some processing
   return pd.DataFrame(final_results)

# and then:  
# generate_report(my_eval_path, process_1, my_output_path)
Lutz Prechelt
  • 36,608
  • 11
  • 63
  • 88