Randomizing rows in a dataset using R

Question

I searched this extensively and all the examples I was able to find randomize row order but not the data in the row itself. I am trying to create a dataset where data needs to be randomized.

I'm trying to turn df into df2;

df:

df <- data.frame(a = c(1:5),
                 b = c(LETTERS[1:5]),
                 c = c(letters[1:5]))

  a b c
1 1 A a
2 2 B b
3 3 C c
4 4 D d
5 5 E e

df2


  a b c
1 2 D b
2 1 B d
3 4 E c
4 3 A a
5 5 C e

I think the reason there are not a lot of solutions for this on people need to keep their data intact but in this case I'm trying to sort of brake the dataset itself, so entries are not correct anymore.

Currently all I can achieve is


df2 <- df[sample(1:nrow(df)), ]

  a b c
3 3 C c
4 4 D d
2 2 B b
1 1 A a
5 5 E e

which randomizes the order of the rows but keeps the data intact.

Thank you! It works! Trying to find a way to mark it as correct answer. Could you explain df; df2[] briefly? — puredata, Nov 07 '20 at 05:19
`df2 <- df` copies the dataframe to a new object so the original doesn't get overwritten; skip that if you don't care. Because a dataframe is a list of columns, `lapply(df2, sample)` calls `sample()` on each column and returns the results in a list. `df2[] <-` assigns that list back to `df2`, but because of the `[]`, it assigns to a subset of the object (which happens to be the whole thing here), so it keeps its data frame class instead of overwriting it with a new object like `df2 <-` would. — alistaire, Nov 07 '20 at 05:34
thanks a lot! didn't know about using semicolons like a new line. to refine the use of this; what should I try if I only want to randomize some columns? I tried subsetting on df2 inside the lapply, but it didn't work as expected. — puredata, Nov 07 '20 at 07:59
It's list subsetting, so subset the dataframe as you would a list, with a single set of indices for columns, e.g. `iris[1:4] <- lapply(iris[1:4], sample)`. Make sure you assign to the same columns you're iterating on, though, or things will get weird. — alistaire, Nov 07 '20 at 09:04

score 1 · Answer 1 · answered Nov 07 '20 at 05:16

1

You can apply sample to each column of the dataframe.

library(dplyr)
df2 <- df %>% mutate(across(.fns = sample))
#In older version of `dplyr` use `mutate_all`
#df2 <- df %>% mutate_all(sample)

#  a b c
#1 5 C c
#2 3 B e
#3 2 E d
#4 4 D b
#5 1 A a

answered Nov 07 '20 at 05:16

Ronak Shah

377,200
20
156
213

what is the .fns here about? – hachiko Nov 07 '20 at 06:02
It is to specify the function that we want to apply to each column. – Ronak Shah Nov 07 '20 at 06:07
Thank you, for my use case this works as well. But when I try using this on mtcars dataset, it flattens row names which are car names in this case. Any idea why and how to avoid? Also how can I find more info about .fns usage? – puredata Nov 07 '20 at 08:27
Tibbles don't support rownames so if you want to store rowname information you need to add them as separate column and do `mtcars %>% rownames_to_column() %>% mutate(across(.fns = sample))`. For more information about `.fns` see `?across`. – Ronak Shah Nov 07 '20 at 08:32

Randomizing rows in a dataset using R

1 Answers1