Shuffle one column in pandas dataframe

Question

How does one shuffle only one column of data in pandas?

I have a Dataframe with production data that I want to load onto dev for testing. However, the data contains personally identifiable information so I want to shuffle those columns.

Columns: FirstName LastName Birthdate SSN OtherData

If the original dataframe is created by read_csv and I want to translate the data into a second dataframe for sql loading but shuffle first name, last name, and SSN, I would have expected to be able to do this:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.shuffle(df[4])
    df1['HS_LAST_NAME'] = np.random.shuffle(df[6])
    df1['HS_SSN'] = np.random.shuffle(df[8])

However, when I try that I get the following error:

A value is trying to be set on a copy of a slice from a DataFrame

Please see the linked post, particularly [this answer](https://stackoverflow.com/a/53954986/4909087). — cs95, Jan 02 '19 at 16:04
In addition to resolving the error, an alternative way to shuffle with pandas is to use `df.sample(frac=1)`. E.g. `df1['HS_FIRST_NAME'] = df[4].sample(frac=1)`. — Chris, Jan 02 '19 at 16:06
df[4].sample(frac=1) runs without error but does not appear to shuffle the data. — Arlo Guthrie, Jan 02 '19 at 16:58
Just curious ... where in that 10,000 line answer does he point out how to shuffle one column of data in a dataframe? :D — Arlo Guthrie, Jan 02 '19 at 17:01
The answer is that it could be as simple as numpy.random.shuffle(df['column_name']). However, Python will throw a warning because pandas does not want you to alter columns that are indexed. The better way is to create a numpy array and then shuffle ( myarry = df['column_name'].values /n numpy.random.shuffle(myarray) ). If you need to then insert that data into a dataframe, you simply convert it back to series ( df['randomized_column'] = pd.Series(myarray) — Arlo Guthrie, Jan 02 '19 at 18:56
@coldspeed, I reopened this one, felt there is a trivial way to amend OP's algorithm to do what they want. Possibly an XY problem. — jpp, Jan 02 '19 at 22:41

jpp · Accepted Answer · 2019-01-02T22:38:38.183

The immediate error is a symptom of using an inadvisable approach when working with dataframes.

np.random.shuffle works in-place and returns None, so assigning to the output of np.random.shuffle will not work. In fact, in-place operations are rarely required, and often yield no material benefits.

Here, for example, you can use np.random.permutation and use NumPy arrays via pd.Series.values rather than series:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
    df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
    df1['HS_SSN'] = np.random.permutation(df[8].values)

score 18 · Answer 2 · answered Jul 27 '20 at 10:37

18

This also appears to do the job:

df1['HS_FIRST_NAME'] = df[4].sample(frac=1).values

answered Jul 27 '20 at 10:37

jeremy_rutman

3,552
4
28
47

Note that .values or .to_numpy() after .sample is mandatory, otherwise original column gets saved without shuffling. – Anatoly Alekseev Aug 08 '23 at 23:41

Shuffle one column in pandas dataframe

2 Answers2