1

Let's say that I have a data frame of three columns: age, gender, and country.

I want to randomly shuffle this data but in an ordered fashion according to gender. There are n males and m females, where n could be less than, greater than, or equal to m. The shuffling should happen in such a way that we get the following results for a size of 8 people:

male, female, male, female, male, female, female, female,.... (if there are more females: m > n) male, female, male, female, male, male, male, male (if there are more males: n > m) male, female, male, female, male, female, male, female, male, female (if equal males and females: n = m)

df = pd.DataFrame({'Age': [10, 20, 30, 40, 50, 60, 70, 80],
                   'Gender': ["Male", "Male", "Male", "Female", "Female", "Male", "Female", "Female"], 
'Country': ["US", "UK", "China", "Canada", "US", "UK", "China", "Brazil"]})
Osaama Shehzad
  • 147
  • 1
  • 2
  • 12

2 Answers2

2

First add the sequence numbers within each group:

df['Order'] = df.groupby('Gender').cumcount()

Then sort:

df.sort_values('Order')

It gives you:

   Age  Gender Country  Order
0   10    Male      US      0
3   40  Female  Canada      0
1   20    Male      UK      1
4   50  Female      US      1
2   30    Male   China      2
6   70  Female   China      2
5   60    Male      UK      3
7   80  Female  Brazil      3

If you want to shuffle, do that at the very beginning, e.g. df = df.sample(frac=1), see: Shuffle DataFrame rows

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I always forget about `.cumcount()`, which is a very useful function. Nice. – David Erickson Jun 06 '20 at 06:41
  • Dear John, Thank you so much for this strategy! It works! Saved me from writing a for-loop code with lots of if/else conditionals. To top it off with a whipped cream, may I please suggest David's modification to your original code: ```sort_values(['Sort_Column', 'Gender'], ascending=[True,False])``` – Osaama Shehzad Jun 06 '20 at 07:35
0

Create two new dataframes with a 'Sort_Column' and make the df_male dataframe even values and the df_female dataframe odd values. Then, use pd.concat to bring them back together and use .sort_values() on the 'Sort_Column'.

df = pd.DataFrame({'Age': [10, 20, 30, 40, 50, 60, 70, 80],
                   'Gender': ["Male", "Male", "Male", "Female", "Female", "Male", "Female", "Female"], 
'Country': ["US", "UK", "China", "Canada", "US", "UK", "China", "Brazil"]})
df['Sort_Column'] = 0
df_male = df.loc[df['Gender'] == 'Male'].reset_index(drop=True)
df_male['Sort_Column'] = df_male['Sort_Column'] + df_male.index*2
df_female = df1.loc[df1['Gender'] == 'Female'].reset_index(drop=True)
df_female['Sort_Column'] = df_female['Sort_Column'] + df_female.index*2 + 1
df_sorted=pd.concat([df_male, df_female]).sort_values('Sort_Column').drop('Sort_Column', axis=1).reset_index(drop=True)
df_sorted

Ouput:

    Age Gender  Country
0   10  Male    US
1   40  Female  Canada
2   20  Male    UK
3   50  Female  US
4   30  Male    China
5   70  Female  China
6   60  Male    UK
7   80  Female  Brazil
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • @John Zwinck Thank you for your solution! Very smart, I must say. But it runs into a problem when both genders have the same cumcount. Then the order of F->M->F->M breaks. Here is an example where the order breaks. The gender does not keep toggling between male and female as soon as they have common index. How do you propose we solve this problem? [link](https://drive.google.com/file/d/1N1IFk3lD-pVcqB1O8TsGqVfRD9bqtB3y/view?usp=sharing) – Osaama Shehzad Jun 06 '20 at 07:16
  • I think you replied to the wrong answer :) , but you can do this: `.sort_values(['Sort_Column', 'Gender'], ascending=[True,False])` – David Erickson Jun 06 '20 at 07:21
  • 1
    Dear David, my bad! Thank you so much for helping out! THis is what I love about a community of problem solvers! Together, we come up with best solutions. :) – Osaama Shehzad Jun 06 '20 at 07:33