1

In R, with data munging. I generally do most if not all my basic munging in one go through piping e.g.

df_mung = df %>% 
  filter(X > 1) %>% 
  select(X, Y, Z) %>% 
  group_by(X, Y) %>% 
  summarise(`sum` = sum(Z)) 

Which means, in this example, at the end I have two DataFrames:

  1. df (my original DataFrame)
  2. df_mung (my munged DataFrame))

If I was to do this in Python, I would do something like this:

df_filter = df[df['X']>1]
df_select = df_filter[['X', 'Y', 'Z']]
df_sum = df_select.groupby(['X','Y']).sum()

Which leaves me with four DataFrames (double the amount I had in R):

  1. df (my original DataFrame)
  2. df_filter (my filtered DataFrame)
  3. df_select (my selected columns DataFrame)
  4. df_sum (my summed DataFrame))

Now I could copy my DataFrame back on to itself, like this:

df = df['X']>1
df = df[['X', 'Y', 'Z']]
df = df.groupby(['X','Y']).sum()

But given the highly upvoted response in this post for SettingWithCopyWarning: How to deal with SettingWithCopyWarning in Pandas , this is apparently something I should not be doing.

So my question is, what is the best practice when data munging in Python? Creating a new variable each time I do something, or copying the DataFrame onto itself, or something else?

I am worried that when I do a piece of analysis in Python, I could have tens if not hundreds of DataFrame variables which a) looks messy b) is confusing to people who take over my code.

Many thanks

Nicholas
  • 3,517
  • 13
  • 47
  • 86

2 Answers2

4

I'd just wrap the munging in a function.

  • the intermediate variables are not in any global scope (not messy)
  • the munging function does a single, comprehensible thing (not confusing for people reading your code)
  • the munging function is testable in isolation (good practice).
def munge_df(df):
    df_filter = df['X'] > 1
    df_select = df_filter[['X', 'Y', 'Z']]
    df_sum = df_select.groupby(['X','Y']).sum()
    return df_sum

# ...

df_munged = munge_df(df)  # or just `df = ...` if you don't need the original
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Hey AKX, that is so helpful. I had never thought about doing it that way and I have been doing this for years. Thats a really clean way of doing things and as you said, not confusing at all for people reading my code.... super thanks! :) – Nicholas Sep 20 '21 at 12:28
1

You could skip the SettingWithCopyWarning using loc and do the filtering of rows and columns in one expression. You could also do method chaining which seems like what you are doing in the R example.

df.loc[df['X'].gt(1), ['X', 'Y', 'Z']].groupby(['X', 'Y']).sum()

Anders Källmar
  • 366
  • 1
  • 4
  • Hey Anders, thank you for your response. I do need to look more into piping in Python, or the equivalent formatting and I was aware of .loc, with the SettingWithCopyWarning.. I guess bad example ha! – Nicholas Sep 20 '21 at 12:42
  • 1
    You basically can chain all methods as long as you know what they output - a DataFrame, a Series, a GroupBy object etc. I think there is a typo in your example. The filtered DataFrame should probably be df_filter = df[df['X']>1]? Otherwise you will just get a Series of booleans. – Anders Källmar Sep 20 '21 at 12:50
  • Ahh, ye. Thanks. I will make the edit and interesting to know! :) – Nicholas Sep 20 '21 at 13:16