21

How does one shuffle only one column of data in pandas?

I have a Dataframe with production data that I want to load onto dev for testing. However, the data contains personally identifiable information so I want to shuffle those columns.

Columns: FirstName LastName Birthdate SSN OtherData

If the original dataframe is created by read_csv and I want to translate the data into a second dataframe for sql loading but shuffle first name, last name, and SSN, I would have expected to be able to do this:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.shuffle(df[4])
    df1['HS_LAST_NAME'] = np.random.shuffle(df[6])
    df1['HS_SSN'] = np.random.shuffle(df[8])

However, when I try that I get the following error:

A value is trying to be set on a copy of a slice from a DataFrame

jpp
  • 159,742
  • 34
  • 281
  • 339
Arlo Guthrie
  • 1,152
  • 3
  • 12
  • 28
  • Please see the linked post, particularly [this answer](https://stackoverflow.com/a/53954986/4909087). – cs95 Jan 02 '19 at 16:04
  • In addition to resolving the error, an alternative way to shuffle with pandas is to use `df.sample(frac=1)`. E.g. `df1['HS_FIRST_NAME'] = df[4].sample(frac=1)`. – Chris Jan 02 '19 at 16:06
  • df[4].sample(frac=1) runs without error but does not appear to shuffle the data. – Arlo Guthrie Jan 02 '19 at 16:58
  • 4
    Just curious ... where in that 10,000 line answer does he point out how to shuffle one column of data in a dataframe? :D – Arlo Guthrie Jan 02 '19 at 17:01
  • 1
    The answer is that it could be as simple as numpy.random.shuffle(df['column_name']). However, Python will throw a warning because pandas does not want you to alter columns that are indexed. The better way is to create a numpy array and then shuffle ( myarry = df['column_name'].values /n numpy.random.shuffle(myarray) ). If you need to then insert that data into a dataframe, you simply convert it back to series ( df['randomized_column'] = pd.Series(myarray) – Arlo Guthrie Jan 02 '19 at 18:56
  • @coldspeed, I reopened this one, felt there is a trivial way to amend OP's algorithm to do what they want. Possibly an XY problem. – jpp Jan 02 '19 at 22:41

2 Answers2

22

The immediate error is a symptom of using an inadvisable approach when working with dataframes.

np.random.shuffle works in-place and returns None, so assigning to the output of np.random.shuffle will not work. In fact, in-place operations are rarely required, and often yield no material benefits.

Here, for example, you can use np.random.permutation and use NumPy arrays via pd.Series.values rather than series:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
    df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
    df1['HS_SSN'] = np.random.permutation(df[8].values)
jpp
  • 159,742
  • 34
  • 281
  • 339
18

This also appears to do the job:

df1['HS_FIRST_NAME'] = df[4].sample(frac=1).values
jeremy_rutman
  • 3,552
  • 4
  • 28
  • 47