5

I encountered a strange behavior with Pandas .groupby() with .transform(). Here is the code to generate the dataset:

df = pd.DataFrame({"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
                   "Random_Number": [1223344, 373293832, 32738382392, 7273283232, 8239329, 23938832],
                   "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

This is the function I wrote for the transform().

# this function will attach each value in string col with the number of elements in the each city group
# if the col type is not an object, then return 0 for all rows. 

def some(x): 
    if x.dtype == 'object':
        return x + '--' + str(len(x))
    else:
        return 0

Then I used my function with transform - works flawlessly and get what I want.

df_2 = stack.groupby(["City"])['Name','Random_Number'].transform(some)

enter image description here

HOWEVER, the strange thing happens when I switch the order of the col from ['Name','Random_Number'] to ['Random_Number','Name']

df_2 = stack.groupby(["City"])['Random_Number','Name'].transform(some)

enter image description here

When you look at cells in the 'Name' column it seems like pandas puts everything into one cell multiple times:

df_2.iloc[0,1]
# Return: 
# 0      Alice--4
# 1        Bob--4
# 3    Mallory--4
# 4        Bob--4
# Name: Name, dtype: object

Why is this happening?

ALollz
  • 57,915
  • 7
  • 66
  • 89
vae
  • 132
  • 6
  • That's not how you change column order. For example `df = df[df.columns[new_order]]` from answer to question https://stackoverflow.com/q/13148429/5660315 will do this. – VirtualScooter Mar 04 '21 at 17:48
  • Does this answer your question? [How to change the order of DataFrame columns?](https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns) – VirtualScooter Mar 04 '21 at 17:48
  • Thank you so much for your comment, it definitely helps! – vae Mar 04 '21 at 19:11

1 Answers1

2

The issue is with your return.

if x.dtype == 'object' then you return a Series so your transform aggregation does not reduce (the return is the same length as the original). If it takes the other path the return is the single scalar 0, which pandas sees as a reduction (return is a single value per group).

Because your aggregation differs in reduction whatever internals pandas uses to figure out the path to take and how to bring that back to the original DataFrame gets confused based on your column ordering. When 'Random_Number' is first, it checks the function, sees that the function reduces and takes one path, but if 'Name' comes first it checks, sees the function doesn't reduce and takes another path to compute.

You can fix this by ensuring that both returns do not reduce

def some(x): 
    if x.dtype == 'object':
        return x + '--' + str(len(x))
    else:
        return [0]*len(x)

df.groupby('City')[['Random_Number','Name']].transform(some)
#   Random_Number        Name
#0              0    Alice--4
#1              0      Bob--4
#2              0  Mallory--2
#3              0  Mallory--4
#4              0      Bob--4
#5              0  Mallory--2

df.groupby('City')[['Name', 'Random_Number']].transform(some)
#         Name  Random_Number
#0    Alice--4              0
#1      Bob--4              0
#2  Mallory--2              0
#3  Mallory--4              0
#4      Bob--4              0
#5  Mallory--2              0
ALollz
  • 57,915
  • 7
  • 66
  • 89