-4

I have a species column containing 10 species names. I have to distribute the species into four columns randomly such that each column will take a specific percentage of species.

Let's say the first column takes 20%, the second 30%, the third 40% and the last 10%. The four columns will be four different environments i.e.:

Restricted, Tidalflat, beach, estuary

Hence the column intake will be predefined but the selection will be random.

My input data will look like this:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')

Result should look like this:

enter image description here

Jaap
  • 81,064
  • 34
  • 182
  • 193

2 Answers2

3

Some simple setup:

species <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium')
rspecies <- sample(species)

envirs <- c('Restricted', 'Tidalflat', 'Beach', 'Estuary')

probs <- c(.2, .3, .4, .1)

nrs <- round(length(species) * probs)

Now, a data.frame with separate columns is not a very good way of expressing your data, as your data is not rectangular, i.e. you don't have the same number of observations in each column.

You can either present the data in long form:

df <- data.frame(species = rspecies, envir = rep(envirs, nrs), stringsAsFactors = FALSE)
     species      envir
1    Tellina Restricted
2     Natica Restricted
3       Arca  Tidalflat
4     Mactra  Tidalflat
5    Tellina  Tidalflat
6       Arca      Beach
7  Nassarius      Beach
8    Cardium      Beach
9    Cardium      Beach
10    Natica    Estuary

Or as a list:

split(rspecies, df$envir)
$Beach
[1] "Mactra" "Natica" "Arca"   "Arca"  

$Estuary
[1] "Tellina"

$Restricted
[1] "Nassarius" "Cardium"  

$Tidalflat
[1] "Cardium" "Natica"  "Tellina"

Edit:

One way to accommodate different number of species, is to make the assignment probabilistic according the environment. This will work better the larger the actual dataset is.

species2 <- c('Natica','Tellina','Mactra','Natica','Arca','Arca','Tellina',
             'Nassarius','Cardium','Cardium', 'Cardium')
length(species2)

[1] 11

grps <- sample(envirs, size = length(species2), prob = probs, replace = TRUE)
df2 <- data.frame(species = species2, envir = grps, stringsAsFactors = FALSE) 
df2 <- df2[order(df2$envir), ]
     species      envir
5       Arca      Beach
10   Cardium      Beach
1     Natica    Estuary
11   Cardium    Estuary
3     Mactra Restricted
7    Tellina Restricted
2    Tellina  Tidalflat
4     Natica  Tidalflat
6       Arca  Tidalflat
8  Nassarius  Tidalflat
9    Cardium  Tidalflat
Community
  • 1
  • 1
Axeman
  • 32,068
  • 8
  • 81
  • 94
  • Thank you. This is helpful. – bidisha ..som Sep 29 '16 at 09:02
  • 1
    As a result of the rounding, you might run into trouble when the length of the vector isn't a multiple of 10. Adding up the values in the `nrs` vector may in such cases not be the same as the length of the `species` vector. – Jaap Sep 29 '16 at 09:10
  • The environments in the second code is not taking values as it is assigned by its probability ie. Tidal Flat should be taking only 3 species and not 5. – bidisha ..som Sep 29 '16 at 12:35
  • Well, yeah, it's probabilistic. It will vary around the true required probabilities. – Axeman Sep 29 '16 at 21:08
  • I required fixed probabilities as I have mentioned that the intake capacity of each environment will be predetermined. – bidisha ..som Sep 30 '16 at 04:28
1

Maybe not in one line of code. I did not understand the column part, but you could use below to create a data frame but your column lengths are unequal.

species <- 1:1000

ranspecies <- sample(species)
 first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]

or to match your example

species <- c('Natica'
             ,'Tellina'
             ,'Mactra'
             ,'Natica'
             ,'Arca'
             ,'Arca'
             ,'Tellina'
             ,'Nassarius'
             ,'Cardium'
             ,'Cardium')

ranspecies <- sample(species)
first20 <- ranspecies[1:(floor(length(species)*.20))]
second30 <- ranspecies[(floor(length(species)*.20)+1):(floor(length(species)*.50))]
third40 <- ranspecies[(floor(length(species)*.50)+1):(floor(length(species)*.90))]
forth10 <- ranspecies[(floor(length(species)*.90)+1):length(species)]
dflength <- max(length(first20), length(second30), length(third40),length(forth10))
data.frame(f = c(first20,rep(NA,dflength-length(first20)))
           ,s = c(second30,rep(NA,dflength-length(second30)))
           ,t = c(third40,rep(NA,dflength-length(third40)))
           ,f = c(forth10,rep(NA,dflength-length(forth10)))
           )

Allthough I feel that some of the steps can be more compact. But I'll let you fiddle with it some more.

Wietze314
  • 5,942
  • 2
  • 21
  • 40
  • 1
    The species are sorted randomly and then the data is divided in 4 chunks. It is random in which chunk of data the species ends. It is similar to roll a dice per species to select the group it should be in. But now you are sure that the groups are exactly the percentages you want. – Wietze314 Sep 29 '16 at 07:37
  • i think my examples will help you to understand my problem in a better way. – bidisha ..som Sep 29 '16 at 07:41
  • Thankyou for the reply. I think that will help. – bidisha ..som Sep 29 '16 at 07:50