Strange Behavior With Pandas Group By - Transform On String Columns

Question

I encountered a strange behavior with Pandas .groupby() with .transform(). Here is the code to generate the dataset:

df = pd.DataFrame({"Name" : ["Alice", "Bob", "Mallory", "Mallory", "Bob" , "Mallory"] ,
                   "Random_Number": [1223344, 373293832, 32738382392, 7273283232, 8239329, 23938832],
                   "City" : ["Seattle", "Seattle", "Portland", "Seattle", "Seattle", "Portland"]})

This is the function I wrote for the transform().

# this function will attach each value in string col with the number of elements in the each city group
# if the col type is not an object, then return 0 for all rows. 

def some(x): 
    if x.dtype == 'object':
        return x + '--' + str(len(x))
    else:
        return 0

Then I used my function with transform - works flawlessly and get what I want.

df_2 = stack.groupby(["City"])['Name','Random_Number'].transform(some)

HOWEVER, the strange thing happens when I switch the order of the col from ['Name','Random_Number'] to ['Random_Number','Name']

df_2 = stack.groupby(["City"])['Random_Number','Name'].transform(some)

When you look at cells in the 'Name' column it seems like pandas puts everything into one cell multiple times:

df_2.iloc[0,1]
# Return: 
# 0      Alice--4
# 1        Bob--4
# 3    Mallory--4
# 4        Bob--4
# Name: Name, dtype: object

Why is this happening?

That's not how you change column order. For example `df = df[df.columns[new_order]]` from answer to question https://stackoverflow.com/q/13148429/5660315 will do this. — VirtualScooter, Mar 04 '21 at 17:48
Does this answer your question? [How to change the order of DataFrame columns?](https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns) — VirtualScooter, Mar 04 '21 at 17:48

ALollz · Accepted Answer · 2021-03-04T19:41:02.043

The issue is with your return.

if x.dtype == 'object' then you return a Series so your transform aggregation does not reduce (the return is the same length as the original). If it takes the other path the return is the single scalar 0, which pandas sees as a reduction (return is a single value per group).

Because your aggregation differs in reduction whatever internals pandas uses to figure out the path to take and how to bring that back to the original DataFrame gets confused based on your column ordering. When 'Random_Number' is first, it checks the function, sees that the function reduces and takes one path, but if 'Name' comes first it checks, sees the function doesn't reduce and takes another path to compute.

You can fix this by ensuring that both returns do not reduce

def some(x): 
    if x.dtype == 'object':
        return x + '--' + str(len(x))
    else:
        return [0]*len(x)

df.groupby('City')[['Random_Number','Name']].transform(some)
#   Random_Number        Name
#0              0    Alice--4
#1              0      Bob--4
#2              0  Mallory--2
#3              0  Mallory--4
#4              0      Bob--4
#5              0  Mallory--2

df.groupby('City')[['Name', 'Random_Number']].transform(some)
#         Name  Random_Number
#0    Alice--4              0
#1      Bob--4              0
#2  Mallory--2              0
#3  Mallory--4              0
#4      Bob--4              0
#5  Mallory--2              0

Wow, that is the reason! thank you so much! Now I got it! – vae Mar 04 '21 at 19:11 — vae, Mar 04 '21 at 19:11

Strange Behavior With Pandas Group By - Transform On String Columns

1 Answers1