Is there a best or recommended practice for naming dataframes while modifying them?

Question

It does not always make sense and I've additionally learned from here to not use inplace=True in pandas. As a result, my code often looks like this:

df = pd.read_csv("path_to_file.csv")
df1 = df.drop(columns_to_be_deleted, axis = "columns")
df2 = df1.apply(lambda x: my_own_function(x), axis = 1)
...
df6 = df5.apply(lambda x: my_other_function(x), axis = 1)

This especially leads to problems when I try and insert a new modification later on ("df_1_a", "df_1_b"). A way to prevent this is labeling the dataframes more meaningful like "df_applied_f1". However, this approach becomes annoying when working a lot with the long name.

Are there any best practices available how to deal with this problem?

Unless you need to reference each of these dataframes you might find it easier to just overwrite `df`. — Kraigolas, Jan 24 '22 at 20:57

score 3 · Answer 1 · answered Jan 24 '22 at 21:44

3

You can use a pipeline. This way no need to use intermediate variables.

A handy side effect is that you can also easily comment one or more steps if needed (assuming your code tolerates to skip this/these step(s))

df = (pd.read_csv("path_to_file.csv")
        .drop(columns_to_be_deleted, axis = "columns")
        .apply(lambda x: my_own_function(x), axis = 1)
       # ...
        .apply(lambda x: my_other_function(x), axis = 1)
     )

answered Jan 24 '22 at 21:44

mozway

194,879
13
39
75

Thanks, looks very clean! However, I sometimes want to access a certain version of the df within the pipeline. – leodreieck Jan 25 '22 at 08:27

Is there a best or recommended practice for naming dataframes while modifying them?

1 Answers1