431

I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?

Peter O.
  • 32,158
  • 14
  • 82
  • 96
nikhil
  • 9,023
  • 22
  • 55
  • 81

13 Answers13

551

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
           X1         X2
1   0.7091409 -1.4061361
2  -1.1334614 -0.1973846
3   2.3343391 -0.4385071
4  -0.9040278 -0.6593677
5   0.4180331 -1.2592415
6   0.7572246 -0.5463655
7  -0.8996483  0.4231117
8  -1.0356774 -0.1640883
9  -0.3983045  0.7157506
10 -0.9060305  2.3234110

Then select some rows at random:

> df[sample(nrow(df), 3), ]
           X1         X2
9  -0.3983045  0.7157506
2  -1.1334614 -0.1973846
10 -0.9060305  2.3234110
John Colby
  • 22,169
  • 4
  • 57
  • 69
  • 4
    @nikhil See [here](http://cran.r-project.org/manuals.html) and [here](http://cran.r-project.org/faqs.html) for starters. You can also type `?sample` in the R console to read about that function. – joran Nov 25 '11 at 19:50
  • 17
    Can someone explain why sample(df,3) does not work? Why do you need df[sample(nrow(df), 3), ]? – stackoverflowuser2010 Jan 15 '14 at 08:03
  • 5
    @stackoverflowuser2010, you can type ?sample and see that the first argument in the sample function must be a vector or a positive integer. I don't think a data.frame works as a vector in this case. – David Braun Jan 31 '14 at 02:43
  • 13
    Remember to set your seed (e.g. `set.seed(42)` ) every time you want to reproduce that specific sample. – CousinCocaine Apr 10 '14 at 08:47
  • 2
    `sample.int` would be slightly faster I believe: `library(microbenchmark);microbenchmark( sample( 10000, 100 ), sample.int( 10000, 100 ), times = 10000 )` – Ari B. Friedman Nov 01 '14 at 15:04
  • @stackoverflowuser2010 On a data frame, sample selects random columns (eg your variables) instead of random rows (your observations). So you have to sample row indexes instead of the data frame. – Roger Filmyer Nov 30 '14 at 21:07
  • Is there a way to have the random rows be consecutive? – user2113499 Sep 16 '15 at 20:17
  • I want to apply this function n (say 1000) times to a dataframe to randomly extract a specified number of rows (with replacement)n times. That is, I want to repeat this function n times (with replacement) to get n random subsets. How do I do it? – Davide Piffer May 04 '19 at 15:27
  • @DavidePiffer `replicate(1000, df[sample(nrow(df), 3), ], simplify=FALSE)` – John Colby May 07 '19 at 22:48
  • 1
    Is this with replacement or without? – mLstudent33 Feb 15 '20 at 04:43
  • @JohnColby If I want to save 2 dataframe (1 for randomly selected and 2nd for the rest of the row of the dataframe) then how, I have to write? THat's mean, for the row number (1, 3, 4, 5, 6, 7, 8) how I will save them? – 0Knowledge Feb 21 '21 at 02:08
  • Besides not working otherwise, why is it necessary to have the comma after "3)"? – somehume Mar 18 '21 at 06:25
285

The answer John Colby gives is the right answer. However if you are a dplyr user there is also the answer sample_n:

sample_n(df, 10)

randomly samples 10 rows from the dataframe. It calls sample.int, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).

Jaap
  • 81,064
  • 34
  • 182
  • 193
kasterma
  • 4,259
  • 1
  • 20
  • 27
  • 10
    As of dplyr 1.0.0, sample_n (and sample_frac) have been superseded by slice_sample, though they remain for now. – Matt_B Sep 04 '20 at 02:50
  • This appears to sample without replacement, and hence also outputs a sample of size min(nrow(df), 10), so this might not be what is needed. – user11130854 Feb 03 '21 at 09:46
45

The data.table package provides the function DT[sample(.N, M)], sampling M random rows from the data table DT.

library(data.table)
set.seed(10)

mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
2: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
3: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
gented
  • 1,620
  • 1
  • 16
  • 20
36

Write one! Wrapping JC's answer gives me:

randomRows = function(df,n){
   return(df[sample(nrow(df),n),])
}

Now make it better by checking first if n<=nrow(df) and stopping with an error.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
29

Just for completeness sake:

dplyr also offers to draw a proportion or fraction of the sample by

df %>% sample_frac(0.33)

This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%

Agile Bean
  • 6,437
  • 1
  • 45
  • 53
11

As @matt_b indicates, sample_n() & sample_frac() have been soft deprecated in favour of slice_sample(). See the dplyr docs.

Example from docstring:

# slice_sample() allows you to random select with or without replacement
mtcars %>% slice_sample(n = 5)
mtcars %>% slice_sample(n = 5, replace = TRUE)

M_Merciless
  • 379
  • 6
  • 12
9

EDIT: This answer is now outdated, see the updated version.

In my R package I have enhanced sample so that it now behaves as expected also for data frames:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
example(sample.data.frame)

smpl..> set.seed(42)

smpl..> sample(data.frame(a=c(1,2,3), b=c(4,5,6),
                           row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

This is achieved by making sample an S3 generic method and providing the necessary (trivial) functionality in a function. A call to setMethod fixes everything. The original implementation still can be accessed through base::sample.

Community
  • 1
  • 1
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • 1
    What is unexpected about its treatment of data frames? – a different ben Aug 23 '13 at 05:20
  • 2
    @adifferentben: When I call `sample.default(df, ...)` for a data frame `df`, it samples from the *columns* of the data frame, as a data frame is implemented as a list of vectors of the same length. – krlmlr Aug 23 '13 at 07:05
  • Is your package still available? I ran `install_github('kimisc', 'krlmlr')` and got `Error: Does not appear to be an R package (no DESCRIPTION)`. Any way around that? – terdon Aug 26 '13 at 14:23
  • Sorry to bug you again but since you wrote this (great) package, do you think you could comment on [this](http://stackoverflow.com/q/18492844/1081936)? – terdon Aug 28 '13 at 16:00
  • @krlmlr I don't agree with you. Nice functionality in your package, but sample() works on a data frame as expected. You confuse a data frame with a matrix. It's not. It's a list. It's indeed not intuitive to see it that way, but that's because far too many people never realized a data frame is a list. Also note that installing your package may break other code dependent on the original behaviour of sample(). – Joris Meys Sep 06 '13 at 09:27
  • 1
    @JorisMeys: Agreed, except for the "as expected" part. Just because a data frame is *implemented* as a list internally, it doesn't mean it should *behave* as one. The `[` operator for data frames is a counterexample. Also, please tell me: Have you ever, just one single time, used `sample` to sample columns from a data frame? – krlmlr Sep 06 '13 at 10:01
  • 1
    @krlmlr The [ operator is not a counterexample: `iris[2]` works like a list, as does `iris[[2]]`. Or `iris$Species`, `lapply(iris, mean)`, ... Data frames are lists. So I expect them to behave like them. And yes, I have actually used sample(myDataframe). On a dataset where every variable contains expression data of a single gene. Your specific method helps novice users, but also effectively changing the way `sample()`behaves. Note I use "as expected" from a programmer's view. Which is different from the general intuition. There's a lot in R that's not compatible with general intuition... ;) – Joris Meys Sep 06 '13 at 14:19
  • @JorisMeys: Fair enough. I was wrong assuming that no one would ever use `sample(dataframe)`... I'll change the function name to `sample.rows` and not use it as S3 method. -- Concerning `[`, I was referring to the `myList[i, j]` syntax. – krlmlr Sep 06 '13 at 21:01
  • I found this StackOverflow question because I'm new to R, and I just tried sample(dataframe), resulting in unexpected bizarreness. I agree with krlmir here. Why does sample(dataframe, 3) not give me 3 random rows from dataframe? – stackoverflowuser2010 Jan 15 '14 at 08:05
  • @stackoverflowuser2010: See [the updated version of this answer](http://stackoverflow.com/a/16538269/946850) for a solution. – krlmlr Jan 15 '14 at 11:43
9

Outdated answer. Please use dplyr::sample_frac() or dplyr::sample_n() instead.

In my R package there is a function sample.rows just for this purpose:

install.packages('kimisc')

library(kimisc)
example(sample.rows)

smpl..> set.seed(42)

smpl..> sample.rows(data.frame(a=c(1,2,3), b=c(4,5,6),
                               row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

Enhancing sample by making it a generic S3 function was a bad idea, according to comments by Joris Meys to a previous answer.

krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • A note from `?sample_frac`: "*[Superseded]* ‘sample_n()’ and ‘sample_frac()’ have been superseded in favour of ‘slice_sample()’" – quickshiftin Apr 16 '22 at 17:56
8

You could do this:

library(dplyr)

cols <- paste0("a", 1:10)
tab <- matrix(1:1000, nrow = 100) %>% as.tibble() %>% set_names(cols)
tab
# A tibble: 100 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     1   101   201   301   401   501   601   701   801   901
 2     2   102   202   302   402   502   602   702   802   902
 3     3   103   203   303   403   503   603   703   803   903
 4     4   104   204   304   404   504   604   704   804   904
 5     5   105   205   305   405   505   605   705   805   905
 6     6   106   206   306   406   506   606   706   806   906
 7     7   107   207   307   407   507   607   707   807   907
 8     8   108   208   308   408   508   608   708   808   908
 9     9   109   209   309   409   509   609   709   809   909
10    10   110   210   310   410   510   610   710   810   910
# ... with 90 more rows

Above I just made a dataframe with 10 columns and 100 rows, ok?

Now you can sample it with sample_n:

sample_n(tab, size = 800, replace = T)
# A tibble: 800 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1    53   153   253   353   453   553   653   753   853   953
 2    14   114   214   314   414   514   614   714   814   914
 3    10   110   210   310   410   510   610   710   810   910
 4    70   170   270   370   470   570   670   770   870   970
 5    36   136   236   336   436   536   636   736   836   936
 6    77   177   277   377   477   577   677   777   877   977
 7    13   113   213   313   413   513   613   713   813   913
 8    58   158   258   358   458   558   658   758   858   958
 9    29   129   229   329   429   529   629   729   829   929
10     3   103   203   303   403   503   603   703   803   903
# ... with 790 more rows
igorkf
  • 3,159
  • 2
  • 22
  • 31
5

Select a Random sample from a tibble type in R:

library("tibble")    
a <- your_tibble[sample(1:nrow(your_tibble), 150),]

nrow takes a tibble and returns the number of rows. The first parameter passed to sample is a range from 1 to the end of your tibble. The second parameter passed to sample, 150, is how many random samplings you want. The square bracket slicing specifies the rows of the indices returned. Variable 'a' gets the value of the random sampling.

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
4

You could do this:

sample_data = data[sample(nrow(data), sample_size, replace = FALSE), ]
Mohammad
  • 81
  • 4
4

The 2021 way of doing this in the tidyverse is:

library(tidyverse)

df = data.frame(
  A = letters[1:10],
  B = 1:10
)

df
#>    A  B
#> 1  a  1
#> 2  b  2
#> 3  c  3
#> 4  d  4
#> 5  e  5
#> 6  f  6
#> 7  g  7
#> 8  h  8
#> 9  i  9
#> 10 j 10

df %>% sample_n(5)
#>   A  B
#> 1 e  5
#> 2 g  7
#> 3 h  8
#> 4 b  2
#> 5 j 10

df %>% sample_frac(0.5)
#>   A  B
#> 1 i  9
#> 2 g  7
#> 3 j 10
#> 4 c  3
#> 5 b  2

Created on 2021-10-05 by the reprex package (v2.0.0.9000)

abalter
  • 9,663
  • 17
  • 90
  • 145
2

I'm new in R, but I was using this easy method that works for me:

sample_of_diamonds <- diamonds[sample(nrow(diamonds),100),]

PS: Feel free to note if it has some drawback I'm not thinking about.

Leopoldo Sanczyk
  • 1,529
  • 1
  • 26
  • 28
  • Suppose, I have 1000 rows in my df. After applying your code 100 rows will be selected randomly and then how I can store the rest of the 900 rows (which one did not select randomly)? – 0Knowledge Feb 21 '21 at 02:16
  • 1
    @Akib62 try `(rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)])` – Leopoldo Sanczyk Feb 22 '21 at 04:22
  • Not working. When I am using your code (given in the comment) getting the same output as the `diamonds` or `main dataset`. – 0Knowledge Mar 14 '21 at 17:10
  • @Akib62 since that selects the elements not in `sample_of_diamonds`, can you confirm `sample_of_diamonds` is not empty? That could explain your problem. – Leopoldo Sanczyk Mar 15 '21 at 18:22
  • Say, I have 20 rows in my dataset. So when I am applying `sample_of_diamonds <- diamonds[sample(nrow(diamonds),10),]` I am getting `10 rows randomly` and `rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)]` I am getting `20 rows (main dataset)` – 0Knowledge Mar 16 '21 at 00:18
  • @Akib62 I guess you checked it, and those 10 rows in `sample_of..` dataset are efectively inside the `rest_of...` dataset. That's weird, because the line says explicitly to ignore those in the main dataset. Could be the format, some type casting? Did you try to check that, or compare the content of some row in both sets (coding)? – Leopoldo Sanczyk Mar 17 '21 at 03:33