4

I am currently trying to find a way to randomize items in a dataframe row-wise. I found this thread on shuffling/permutation column-wise in pandas (shuffling/permutating a DataFrame in pandas), but for my purposes, is there a way to do something like

import pandas as pd

data = {'day': ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri'],
       'color': ['Blue', 'Red', 'Green', 'Yellow', 'Black'],
       'Number': [11, 8, 10, 15, 11]}

dataframe = pd.DataFrame(data)
    Number   color    day
0      11    Blue    Mon
1       8     Red   Tues
2      10   Green    Wed
3      15  Yellow  Thurs
4      11   Black    Fri

And randomize the rows into some like

    Number   color    day
0      Mon    Blue    11
1      Red    Tues     8
2      10     Wed    Green
3      15    Yellow  Thurs
4      Black   11     Fri

If in order to do so, the column headers would have to go away or something of the like, I understand.

EDIT: So, in the thread I posted, part of the code refers to an "axis" parameter. I understand that axis = 0 refers to the columns and axis =1 refers to the rows. I tried taking the code and changing the axis to 1, and it seems to randomize my dataframe only if the table consists of all numbers (as opposed to a list of strings, or a combination of the two).

That said, should I consider not using dataframes? Is there a better 2D structure where I can randomize the rows and the columns if my data consists of only strings or a combinations of ints and strings?

Community
  • 1
  • 1
avidman
  • 45
  • 1
  • 1
  • 5
  • Note: Zelazny7's answer http://stackoverflow.com/a/15772330/1240268 (or potentially my comment about using iloc) are IMO best bet. – Andy Hayden Jul 11 '14 at 16:04
  • 1
    oops, reopened as it's clearly different. Interested to know *why* you'd want to do this! – Andy Hayden Jul 11 '14 at 16:05
  • Well, I am creating somewhat of a randomizer for an experiment. In order to counterbalance appropriately, I want to be able to randomize the rows and the columns independently from each other, but the data inside the table isn't all ints, but rather, lists of strings, dictionaries, and such. That said, I am trying to find out if there is a way to basically do what was done in the link I posted (randomize column-wise) and apply that to rows. I was able to make this work, but only if the dataframe contains numbers only, though I want to extend the possibility to strings and such. – avidman Jul 11 '14 at 16:45
  • wouldn't it be "more random" to just shuffle the entire values? (ah, ha that's the accepted answer: great!) – Andy Hayden Jul 12 '14 at 00:01

3 Answers3

4

Edit: I misunderstood the question, which was just to shuffle rows and not all the table (right?)

I think using dataframes does not make lots of sense, because columns names become useless. So you can just use 2D numpy arrays :

In [1]: A
Out[1]: 
array([[11, 'Blue', 'Mon'],
       [8, 'Red', 'Tues'],
       [10, 'Green', 'Wed'],
       [15, 'Yellow', 'Thurs'],
       [11, 'Black', 'Fri']], dtype=object)

In [2]: _ = [np.random.shuffle(i) for i in A] # shuffle in-place, so return None

In [3]: A
Out[3]: 
array([['Mon', 11, 'Blue'],
       [8, 'Tues', 'Red'],
       ['Wed', 10, 'Green'],
       ['Thurs', 15, 'Yellow'],
       [11, 'Black', 'Fri']], dtype=object)

And if you want to keep dataframe :

In [4]: pd.DataFrame(A, columns=data.columns)
Out[4]: 
  Number  color     day
0    Mon     11    Blue
1      8   Tues     Red
2    Wed     10   Green
3  Thurs     15  Yellow
4     11  Black     Fri

Here a function to shuffle rows and columns:

import numpy as np
import pandas as pd

def shuffle(df):
    col = df.columns
    val = df.values
    shape = val.shape
    val_flat = val.flatten()
    np.random.shuffle(val_flat)
    return pd.DataFrame(val_flat.reshape(shape),columns=col)

In [2]: data
Out[2]: 
   Number   color    day
0      11    Blue    Mon
1       8     Red   Tues
2      10   Green    Wed
3      15  Yellow  Thurs
4      11   Black    Fri

In [3]: shuffle(data)
Out[3]: 
  Number  color     day
0    Fri    Wed  Yellow
1  Thurs  Black     Red
2  Green   Blue      11
3     11      8      10
4    Mon   Tues      15

Hope this helps

jrjc
  • 21,103
  • 9
  • 64
  • 78
  • Similar to Happy001's post, I am grateful for the flatten bit, as it helps with my future plans in my project, but I need to shuffle/randomize row-wise. – avidman Jul 11 '14 at 15:50
  • @user3010693, Sorry I misunderstood, I edited the answer. Tell me if it fits your needs. – jrjc Jul 11 '14 at 16:15
1

Maybe flatten the 2d array and then shuffle?

In [21]: data2=dataframe.values.flatten()

In [22]: np.random.shuffle(data2)

In [23]: dataframe2=pd.DataFrame (data2.reshape(dataframe.shape), columns=dataframe.columns )

In [24]: dataframe2
Out[24]: 
  Number   color    day
0   Tues  Yellow     11
1    Red   Green    Wed
2  Thurs     Mon   Blue
3     15       8  Black
4    Fri      11     10
Happy001
  • 6,103
  • 2
  • 23
  • 16
  • So, I never knew about flatten (which I find extremely useful, thanks!), but currently what I am trying to so is randomize within a row for each row. The next step would be randomizing within a column, but the row bit is troubling me first. Your code shuffles, but not row-wise =/. – avidman Jul 11 '14 at 15:48
  • FYI, you should use ``.ravel()`` rather than ``.flatten()`` as flatten *always* copies (ravel only if necessary) – Jeff Jul 11 '14 at 16:00
  • Thanks, @Jeff. BTW, in this case I guess `.ravel()` also copies due to different `dtypes`? – Happy001 Jul 11 '14 at 19:20
  • in this case its copied *twice*! ``flatten`` *always* copies, ``ravel`` only if it cannot create a view. in this case ``ravel`` is seeing a single ``object`` dtypes array, which it *may* be able to get a view of (this is numpy dependent). it prob doesn't make much difference in this case in any event. – Jeff Jul 11 '14 at 19:27
1

Building on @jrjc 's answer, I have posted https://stackoverflow.com/a/44686455/5009287 which uses np.apply_along_axis()

a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
 [20 21 22]
 [30 31 32]
 [40 41 42]]

print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
 [22 21 20]
 [31 30 32]
 [40 41 42]]

See the full answer to see how that could be integrated with a Pandas df.

Raphvanns
  • 1,766
  • 19
  • 21