0

It does not always make sense and I've additionally learned from here to not use inplace=True in pandas. As a result, my code often looks like this:

df = pd.read_csv("path_to_file.csv")
df1 = df.drop(columns_to_be_deleted, axis = "columns")
df2 = df1.apply(lambda x: my_own_function(x), axis = 1)
...
df6 = df5.apply(lambda x: my_other_function(x), axis = 1)

This especially leads to problems when I try and insert a new modification later on ("df_1_a", "df_1_b"). A way to prevent this is labeling the dataframes more meaningful like "df_applied_f1". However, this approach becomes annoying when working a lot with the long name.

Are there any best practices available how to deal with this problem?

leodreieck
  • 69
  • 4

1 Answers1

3

You can use a pipeline. This way no need to use intermediate variables.

A handy side effect is that you can also easily comment one or more steps if needed (assuming your code tolerates to skip this/these step(s))

df = (pd.read_csv("path_to_file.csv")
        .drop(columns_to_be_deleted, axis = "columns")
        .apply(lambda x: my_own_function(x), axis = 1)
       # ...
        .apply(lambda x: my_other_function(x), axis = 1)
     )
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks, looks very clean! However, I sometimes want to access a certain version of the df within the pipeline. – leodreieck Jan 25 '22 at 08:27