0

first of all, thank you for everything you are doing. I am currently trying to wrap my head around a problem (new to r). I have a small n dataframe (e.g., n=10) but I want to have a new dataframe that consists of more of those observations (e.g., n=15). The one condition is that I have to make sure that every value (i.e., row) of the old dataset appears at least once in the new dataset. Using sample, I could not achieve this - some rows were missing some of the times.

EDIT Simple Example:

df = data.frame(matrix(rnorm(20), nrow = 10))
df[sample(nrow(df), 14, replace = TRUE), ]
            X1         X2
9    0.5881409  0.1967030
2    1.1227569  1.9827646
1    1.2225747  0.3428867
10  -0.2780021 -2.3581644
4    0.4687276 -2.2431019
5    1.4592202 -0.6397336
7   -0.8779913  0.4293624
3   -0.1663962 -0.2435444
3.1 -0.1663962 -0.2435444
3.2 -0.1663962 -0.2435444
1.1  1.2225747  0.3428867
1.2  1.2225747  0.3428867
6   -1.0797652 -1.1893041
7.1 -0.8779913  0.4293624

However, we see that for example row 8 is missing.

  • Welcome to SO ! In order to better help you, can you provide an example of your small dataframe ? and the one you are trying to obtain ? To add a good reproducible example, please read: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – dc37 Jan 18 '20 at 07:51
  • hi,if you can elaborate more will be helpful.by the question statement what i understand is you want to merge two data frame and keeping all observation of old? – Tushar Lad Jan 18 '20 at 08:01
  • `n <- nrow(df); i <- c(sample(n), sample(n, 5)); i <- sample(i); df_new <- df[i, ]`. – Rui Barradas Jan 18 '20 at 08:01
  • Just edited for some clarification. – Sebastian K Jan 18 '20 at 08:21
  • Thanks @RuiBarradas-ReinstateMonic would this work for n_new = 25 if i specify replace = true? – Sebastian K Jan 18 '20 at 08:29
  • My answer can cope with that case. – Rui Barradas Jan 18 '20 at 08:55

3 Answers3

1
  • Maybe you can try the code below (using sample())
dfout <- rbind(df,df[sample(seq(nrow(df)),5,replace = T),],make.row.names = F)

such that

> dfout
           X1          X2
1  -0.6264538  1.51178117
2   0.1836433  0.38984324
3  -0.8356286 -0.62124058
4   1.5952808 -2.21469989
5   0.3295078  1.12493092
6  -0.8204684 -0.04493361
7   0.4874291 -0.01619026
8   0.7383247  0.94383621
9   0.5757814  0.82122120
10 -0.3053884  0.59390132
11  0.4874291 -0.01619026
12 -0.8356286 -0.62124058
13 -0.3053884  0.59390132
14 -0.8204684 -0.04493361
15  0.7383247  0.94383621
  • or something like below (using replicate)
n <- 15
dfout <- head(do.call(rbind,replicate(ceiling(n/nrow(df)),df,simplify = F)),n)

such that

> dfout
           X1          X2
1  -0.6264538  1.51178117
2   0.1836433  0.38984324
3  -0.8356286 -0.62124058
4   1.5952808 -2.21469989
5   0.3295078  1.12493092
6  -0.8204684 -0.04493361
7   0.4874291 -0.01619026
8   0.7383247  0.94383621
9   0.5757814  0.82122120
10 -0.3053884  0.59390132
11 -0.6264538  1.51178117
12  0.1836433  0.38984324
13 -0.8356286 -0.62124058
14  1.5952808 -2.21469989
15  0.3295078  1.12493092

DATA

df <- structure(list(X1 = c(-0.626453810742332, 0.183643324222082, 
-0.835628612410047, 1.59528080213779, 0.329507771815361, -0.820468384118015, 
0.487429052428485, 0.738324705129217, 0.575781351653492, -0.305388387156356
), X2 = c(1.51178116845085, 0.389843236411431, -0.621240580541804, 
-2.2146998871775, 1.12493091814311, -0.0449336090152309, -0.0161902630989461, 
0.943836210685299, 0.821221195098089, 0.593901321217509)), class = "data.frame", row.names = c(NA, 
-10L))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
1
observations_needed <- 15
new_rows <- sample(
  x = nrow(df), 
  size = observations_needed - nrow(df),
  replace = TRUE)

all_rows <- c(1:nrow(df), new_rows)
result <- sample(all_rows)

new_df <- df[result,]
Christian
  • 932
  • 1
  • 7
  • 22
0

The following function does what the question asks for.

Explanation:

  1. Create a vector i of a permutation of the row numbers of X and of more rows sampled at random with replacement. This behavior could be changed to sampling without replacement if nrow(X) >= more.
  2. Shuffle that vector i.
  3. Extract rows i from the original data frame X.
  4. Set the row names to consecutive integers and return to caller.

Here it is.

larger_df <- function(X, more){
  if(missing(more)) stop(sQuote("more"), " is missing with no default.")
  n <- nrow(X)
  i <- c(sample(n), sample(n, more, replace = TRUE))
  i <- sample(i)
  Y <- X[i, , drop = FALSE]
  row.names(Y) <- NULL
  Y
}

set.seed(1234)
df = data.frame(matrix(rnorm(20), nrow = 10))

larger_df(df1)
larger_df(df1, 5)
larger_df(df1, 25)
larger_df(data.frame(), 5)
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66