In R, with data munging. I generally do most if not all my basic munging in one go through piping e.g.
df_mung = df %>%
filter(X > 1) %>%
select(X, Y, Z) %>%
group_by(X, Y) %>%
summarise(`sum` = sum(Z))
Which means, in this example, at the end I have two DataFrames:
df
(my original DataFrame)df_mung
(my munged DataFrame))
If I was to do this in Python, I would do something like this:
df_filter = df[df['X']>1]
df_select = df_filter[['X', 'Y', 'Z']]
df_sum = df_select.groupby(['X','Y']).sum()
Which leaves me with four DataFrames (double the amount I had in R):
df
(my original DataFrame)df_filter
(my filtered DataFrame)df_select
(my selected columns DataFrame)df_sum
(my summed DataFrame))
Now I could copy my DataFrame back on to itself, like this:
df = df['X']>1
df = df[['X', 'Y', 'Z']]
df = df.groupby(['X','Y']).sum()
But given the highly upvoted response in this post for SettingWithCopyWarning
: How to deal with SettingWithCopyWarning in Pandas , this is apparently something I should not be doing.
So my question is, what is the best practice when data munging in Python? Creating a new variable each time I do something, or copying the DataFrame onto itself, or something else?
I am worried that when I do a piece of analysis in Python, I could have tens if not hundreds of DataFrame variables which a) looks messy b) is confusing to people who take over my code.
Many thanks