1

I have the following data frame of the form:

1 2 3 4 5 6 7 8 
A C C T G A T C
C A G T T A D N
Y F V H Q A F D

I need to randomly select a column k times where k is the number of columns in the given sample. My program creates a list of empty lists of size k and then randomly selects a column from the dataframe to be appended to the list. Each list must be unique and cannot have duplicates.

From the above example dataframe, an expected output should be something like:

[[2][4][6][1][7][3][5][8]]

However I am obtaining results like:

[[1][1][3][6][7][8][8][2]]

What is the most pythonic way to go about doing this? Here is my sorry attempt:

k = len(df.columns)
k_clusters = [[] for i in range(k)]

for i in range(len(k_clusters)):
    for j in range(i + 1, len(k_clusters)):
        k_clusters[i].append((df.sample(1, axis=1)))
        if k_clusters[i] == k_clusters[j]:
            k_clusters[j].pop(0)
            k_clusters[j].append(df.sample(1, axis=1)
mandosoft
  • 163
  • 1
  • 1
  • 8
  • 1
    Just to make sure I understand, are you trying to shuffle the column names? – Mad Physicist Oct 28 '19 at 15:57
  • Or the columns themselves? Or something else entirely? I don't really understand your notation `[[2][4][6][1][7][3][5][8]]`. Could you please clarify? – Mad Physicist Oct 28 '19 at 16:01
  • `df.sample(frac=1, axis=1).to_numpy().T`? – ALollz Oct 28 '19 at 16:06
  • @MadPhysicist the full column (column names and values within them) are being shuffled. For simplicity I used just the column name since I don't want to print the amino acid values. – mandosoft Oct 28 '19 at 16:18

2 Answers2

1

You can use numpy.random.shuffle to just shuffle the column indexes. Because from your question, this is what I assume you want to do.

An example:

import numpy as np

to_shuffle = np.array(df.columns)
np.random.shuffle(to_shuffle)
print(to_shuffle)
Koralp Catalsakal
  • 1,114
  • 8
  • 11
1

Aside from the shuffling step, your question is very similar to How to change the order of DataFrame columns?. Shuffling can be done in any number of ways in Python:

cols = np.array(df.columns)
np.random.shuffle(cols)

Or using the standard library:

cols = list(df.columns)
random.shuffle(cols)

You do not want to do cols = df.columns.values, because that will give you write access to the underlying column name data. You will then end up shuffling the column names in-place, messing up your dataframe.

Rearranging your columns is then easy:

df = df[cols]
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264