2

I want to shuffle the dataframe keeping set of rows together. The number of rows together is not constant but I have a column marking them with same Id.

For EX: In the below data first column is the unique marker specifying which rows needs to be together while shuffling.

2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 46 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 4 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 39 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 10 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 7 -135.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 0 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 35 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 5 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 47 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 12 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 13 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 20 -201.00
2 56.00 1 0.83 2.16 3147890 3120000.00 1 201.00 0 -201.00 116.00 75.88 201.00 232.00 105.74 201.00 168.00 75.88 42 -201.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 46 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 4 -95.00 
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 39 -46.00 
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 10 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 7 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 0 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 35 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 5 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 47 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 12 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 13 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 20 -135.00
4 93.00 1 0.34 3.62 4121000 5340000.00 1 135.00 0 -135.00 78.00 120.53 135.00 10.00 2.67 135.00 313.00 120.53 42 -135.00
6 74.00 0 2.35 2.89 1680840 2940000.00 11 2758.00 0 -2758.00 296.00 74.46 261.00 176.00 75.84 304.00 304.00 74.46 46 -2730.00
6 74.00 0 2.35 2.89 1680840 2940000.00 11 2758.00 0 -2758.00 296.00 74.46 261.00 176.00 75.84 304.00 304.00 74.46 4 -2458.00
6 74.00 0 2.35 2.89 1680840 2940000.00 11 2758.00 0 -2758.00 296.00 74.46 261.00 176.00 75.84 304.00 304.00 74.46 39 -2758.00
6 74.00 0 2.35 2.89 1680840 2940000.00 11 2758.00 0 -2758.00 296.00 74.46 261.00 176.00 75.84 304.00 304.00 74.46 10 -2758.00
6 74.00 0 2.35 2.89 1680840 2940000.00 11 2758.00 0 -2758.00 296.00 74.46 261.00 176.00 75.84 304.00 304.00 74.46 7 -2554.00
6 74.00 0 2.35 2.89 1680840 2940000.00 11 2758.00 0 -2758.00 296.00 74.46 261.00 176.00 75.84 304.00 304.00 74.46 0 -2568.00
sranga
  • 123
  • 1
  • 7
  • 1
    Can you provide an example of the output you are looking for? I can think of a couple of interpretations of your question. Will the end result still have the ones with the 2 markers first but shuffled withing that group? Are the groups kept in their original order but shuffled as a whole? – Zev Jul 02 '18 at 14:14
  • In the above data I have 13 rows with unique marker (based on column 1) 2, 13 rows with marker 4, 6 rows with marker 6. After shuffling , one outcome could be all rows with marker 6 first, then rows with marker 2 and then marker 4. – sranga Jul 02 '18 at 14:15
  • 1
    Do you want to shuffle groups or rows with a group? Maybe both? the order of the groups and the rows in each group? – Scott Boston Jul 02 '18 at 14:16
  • @ScottBoston - Sorry, I did not understand your question. But hopefully, my above comment explains as what I need to achieve. Thanks. – sranga Jul 02 '18 at 14:20
  • @ScottBoston - Not sort but shuffle the data !! – sranga Jul 02 '18 at 14:23

2 Answers2

5

You can use this generator with np.random.choice on unique col1, the pd.concat to re-assemble "groups".

import numpy as np
pd.concat((df[df['col1'] == i] for i in np.random.choice(df['col1'].unique(),
                                                         df['col1'].nunique())))

Details, first get the unique values from 'col1' as list using unique, then select random elements from this list using np.random.choice. Use that selection to boolean select parts ("group") of the dataframe inside a generator using for-in syntax and lastly, use pd.concat to re-assemble the dataframe in to random groups.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

It is unclear what end result you are looking for but regardless, the first step is probably the same. Group the dataframes into separate ones based on that column. Shuffle and recombine as desired.

Recombining can be done by storing the shuffled dataframes as a list and then pd.concat. You can optionally shuffle the list first:

from random import shuffle
shuffle(dfs)    

Using this data set:

2 a2
2 b2
2 c2
3 a3
3 b3
3 c3
3 d3
4 a4
4 b4

This code:

import pandas as pd

df =  pd.read_csv("shuffle.txt", header=None, delim_whitespace=True)
dfs = [x for _, x in df.groupby(df[0])]
from random import shuffle
#shuffle(dfs)
new_dfs = []
for df in dfs:
    df = df.sample(frac=1)
    new_dfs.append(df)

final_df = pd.concat(new_dfs)
print(final_df)

Gets you:

   0   1
2  2  c2
0  2  a2
1  2  b2
5  3  c3
3  3  a3
6  3  d3
4  3  b3
8  4  b4
7  4  a4

Uncommenting the shuffle line gets you:

   0   1
8  4  b4
7  4  a4
6  3  d3
5  3  c3
4  3  b3
3  3  a3
0  2  a2
1  2  b2
2  2  c2
Zev
  • 3,423
  • 1
  • 20
  • 41