Randomly select a certain percentage of rows and create new columns

Question

I have a species column containing 10 species names. I have to distribute the species into four columns randomly such that each column will take a specific percentage of species.

Let's say the first column takes 20%, the second 30%, the third 40% and the last 10%. The four columns will be four different environments i.e.:

Restricted, Tidalflat, beach, estuary

Hence the column intake will be predefined but the selection will be random.

My input data will look like this:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')

Result should look like this:

Thankyou for the reply. Anyway I am facing a problem in creating four columns simultaneously with the specified size. — bidisha ..som, Sep 29 '16 at 06:56
Please have a look at [How to make a great R reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Ronak Shah, Sep 29 '16 at 06:58
`dplyr::sample_frac(mydata, 0.1, replace = F)` gets 10% of your data. — ali srn, Sep 29 '16 at 07:04
No it cannot intersect. The ones which has been already used cannot be used in the second column. — bidisha ..som, Sep 29 '16 at 07:06
Please make your [data reproducible](http://stackoverflow.com/questions/5963269), avoid pasting data as images. — zx8754, Sep 29 '16 at 07:43

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Some simple setup:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')
rspecies <- sample(species)

envirs <- c('Restricted', 'Tidalflat', 'Beach', 'Estuary')

probs <- c(.2, .3, .4, .1)

nrs <- round(length(species) * probs)

Now, a data.frame with separate columns is not a very good way of expressing your data, as your data is not rectangular, i.e. you don't have the same number of observations in each column.

You can either present the data in long form:

df <- data.frame(species = rspecies, envir = rep(envirs, nrs), stringsAsFactors = FALSE)

     species      envir
1    Tellina Restricted
2     Natica Restricted
3       Arca  Tidalflat
4     Mactra  Tidalflat
5    Tellina  Tidalflat
6       Arca      Beach
7  Nassarius      Beach
8    Cardium      Beach
9    Cardium      Beach
10    Natica    Estuary

Or as a list:

split(rspecies, df$envir)

$Beach
[1] "Mactra" "Natica" "Arca"   "Arca"  

$Estuary
[1] "Tellina"

$Restricted
[1] "Nassarius" "Cardium"  

$Tidalflat
[1] "Cardium" "Natica"  "Tellina"

Edit:

One way to accommodate different number of species, is to make the assignment probabilistic according the environment. This will work better the larger the actual dataset is.

species2 <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium', 'Cardium')
length(species2)

[1] 11

grps <- sample(envirs, size = length(species2), prob = probs, replace = TRUE)
df2 <- data.frame(species = species2, envir = grps, stringsAsFactors = FALSE) 
df2 <- df2[order(df2$envir), ]

     species      envir
5       Arca      Beach
10   Cardium      Beach
1     Natica    Estuary
11   Cardium    Estuary
3     Mactra Restricted
7    Tellina Restricted
2    Tellina  Tidalflat
4     Natica  Tidalflat
6       Arca  Tidalflat
8  Nassarius  Tidalflat
9    Cardium  Tidalflat

As a result of the rounding, you might run into trouble when the length of the vector isn't a multiple of 10. Adding up the values in the `nrs` vector may in such cases not be the same as the length of the `species` vector. — Jaap, Sep 29 '16 at 09:10
The environments in the second code is not taking values as it is assigned by its probability ie. Tidal Flat should be taking only 3 species and not 5. — bidisha ..som, Sep 29 '16 at 12:35
Well, yeah, it's probabilistic. It will vary around the true required probabilities. — Axeman, Sep 29 '16 at 21:08
I required fixed probabilities as I have mentioned that the intake capacity of each environment will be predetermined. — bidisha ..som, Sep 30 '16 at 04:28

Wietze314 · Answer 2 · 2016-09-29T07:49:57.667

Maybe not in one line of code. I did not understand the column part, but you could use below to create a data frame but your column lengths are unequal.

species <- 1:1000

ranspecies <- sample(species)
 first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]

or to match your example

species <- c('Natica'
             ,'Tellina'
             ,'Mactra'
             ,'Natica'
             ,'Arca'
             ,'Arca'
             ,'Tellina'
             ,'Nassarius'
             ,'Cardium'
             ,'Cardium')

ranspecies <- sample(species)
first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]
dflength <- max(length(first20), length(second30), length(third40),length(forth10))
data.frame(f = c(first20,rep(NA,dflength-length(first20)))
           ,s = c(second30,rep(NA,dflength-length(second30)))
           ,t = c(third40,rep(NA,dflength-length(third40)))
           ,f = c(forth10,rep(NA,dflength-length(forth10)))
           )

Allthough I feel that some of the steps can be more compact. But I'll let you fiddle with it some more.

The species are sorted randomly and then the data is divided in 4 chunks. It is random in which chunk of data the species ends. It is similar to roll a dice per species to select the group it should be in. But now you are sure that the groups are exactly the percentages you want. — Wietze314, Sep 29 '16 at 07:37
i think my examples will help you to understand my problem in a better way. — bidisha ..som, Sep 29 '16 at 07:41

Randomly select a certain percentage of rows and create new columns

2 Answers2

Edit: