0

I have tried to sample a data from Excel rows without having a replacement. I keep having duplicates. I changed the replace = TRUE, the same issue. Sorry, I am new to R. Code below:

library(dplyr)
library(purrr)
library(tidyr)

#Read in the data
DuT = read.csv("APATA.csv", stringsAsFactors = TRUE)

#Filter by number of transformers
Df1 = DuT %>%
  group_by(DSS.NAME)%>%
  dplyr::summarise(no_rows = length(DSS.NAME))
str(Df1)

#Create Sample Size column
Df1$SampleSize = ifelse(Df1$no_rows >= 1 &Df1$no_rows <= 10, 10,
                        ifelse(Df1$no_rows >= 11 & Df1$no_rows <= 49, 12,
                               ifelse(Df1$no_rows >= 50 & Df1$no_rows <= 99, 17,
                                      ifelse(Df1$no_rows >= 100 & Df1$no_rows <= 199, 24,
                                             ifelse(Df1$no_rows >= 200 & Df1$no_rows <= 299, 27,
                                                    ifelse(Df1$no_rows >= 300 & Df1$no_rows <= 499, 32,
                                                           ifelse(Df1$no_rows >= 500  & Df1$no_rows <= 799, 32,
                                                                  ifelse(Df1$no_rows >= 800 & Df1$no_rows <= 999, 44,
                                                                         ifelse(Df1$no_rows >= 1000 & Df1$no_rows <= 1299, 49,
                                                                                ifelse(Df1$no_rows >= 1300 & Df1$no_rows <= 1500, 57,0))))))))))

sum(Df1$SampleSize)

#Sample based on  name and sampleSize column
Df2 = DuT %>%
  group_by(DSS.NAME) %>%
  arrange(DSS.NAME) %>%
  tidyr::nest() %>%            
  ungroup() %>%
  mutate(n = Df1$SampleSize) %>%
  mutate(samp = purrr::map2(data, n, sample_n, replace = FALSE)) %>%
  select(-data) %>%
  select(-n) %>%
  tidyr::unnest(samp)
write.csv(Df2, "APATA_SAMPLED.csv", row.names = F)
write.csv(Df1, "APATA_SAMPLING SIZE.csv", row.names = F)"""

dput(The file I want to sample)

dput(EXpected output, with Town 1 not up to Sample size 12 displaying all out)

  • 2
    Please add data using `dput` and show the expected output for the same. Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Jun 03 '20 at 09:11
  • If `Df1$no_rows==11`, then `SampleSize=12`: how could the sampling be without replacement? – Roland Jun 03 '20 at 09:25
  • For your `SampleSize` a combination of `case_when` and `mutate` seems useful. Avoid those `if`-chains. And you can combine your `mutate` and `select` functions in `Df2` – Martin Gal Jun 03 '20 at 09:32
  • Hi @Roland, This is to show if the data falls between this category, it should be a sample size of the corresponding number. If less than the sample size, produce all data. Pardon me, I am new to R and also to stack overflow. Pardon my errors – Black_Learner Jun 03 '20 at 10:26
  • How do i use the "case_when" @MartinGal – Black_Learner Jun 03 '20 at 11:01
  • Please don't post data as images. or in .[NORM format](https://xkcd.com/2116/). Take a look at [how to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Martin Gal Jun 03 '20 at 11:33
  • `case_when` is part of the `dplyr` package. Take a look at [dplyr's documentation for case_when](https://dplyr.tidyverse.org/reference/case_when.html), there are some examples. – Martin Gal Jun 03 '20 at 11:38

1 Answers1

0

You can just split the data in a way that you can sample without replacement even when the sample size is greater than the population size like this:

#Sample based on  name and sampleSize column

    Df2 = DuT %>%
      group_by(DSS.NAME) %>% 
      arrange(DSS.NAME)%>%
      tidyr::nest()  %>%          
      ungroup() %>% 
      mutate(SS = Df1$SampleSize) %>%
#number of rows of each table
      mutate(nofRows = map_dbl(data, nrow))

    #split data into SS > popsize & SS < popsize
    Df2_i  = Df2[Df2$SS >= Df2$nofRows, ]
    Df2_ii = Df2[Df2$SS < Df2$nofRows, ]


    #sampling without replacement SS > popsize
    Df3_i = Df2_i %>%  
           mutate(samp = purrr::map2(data, nofRows, sample_n, replace = F))%>%
           select(-data) %>%
           tidyr::unnest(samp)


    #sampling without replacement SS < popsize
    Df3_ii = Df2_ii %>%
           mutate(samp = purrr::map2(data, SS , sample_n, replace = F))%>%
           select(-data) %>%
           tidyr::unnest(samp)

    #join the tables
    Df = rbind(Df3_i,Df3_ii)

Note SS is sample size and nofrows is Number of Rows (population size) of each nested table. I hope this helps

Niz
  • 16
  • 1