1

my data set has got 821049 variables and 18 columns. I would like to take 9 columns for the stratified sampling. These are "BASKETS_NZ", "PIS", "PIS_AP" "PIS_DV", "PIS_PL", "PIS_SDV", "PIS_SHOPS" "PIS_SR", "QUANTITY". My stratification variable is ID = 1:821049. How do I choose the intervals for my variables? How do I set the size of the sampling?

dpt(rbind(head(WKA_ohneJB, 10), tail(WKA_ohneJB, 10)))

structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 

821039L, 821040L, 821041L, 821042L, 821043L, 821044L, 821045L, 

821046L, 821047L, 821048L), BASKETS_NZ = c(1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 

LOGONS = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), PIS = c(71L, 39L, 50L, 4L, 

13L, 4L, 30L, 65L, 13L, 31L, 111L, 33L, 3L, 46L, 11L, 8L, 

17L, 68L, 65L, 15L), PIS_AP = c(14L, 2L, 4L, 0L, 0L, 0L, 

1L, 0L, 2L, 1L, 13L, 0L, 0L, 2L, 1L, 0L, 3L, 8L, 0L, 1L), 

PIS_DV = c(3L, 19L, 4L, 1L, 0L, 0L, 6L, 2L, 2L, 3L, 38L, 

8L, 0L, 5L, 2L, 0L, 1L, 0L, 3L, 2L), PIS_PL = c(0L, 5L, 8L, 

2L, 0L, 0L, 0L, 24L, 0L, 6L, 32L, 8L, 0L, 0L, 4L, 0L, 0L, 

0L, 0L, 0L), PIS_SDV = c(18L, 0L, 11L, 0L, 0L, 0L, 0L, 0L, 

0L, 1L, 6L, 0L, 0L, 13L, 0L, 0L, 1L, 15L, 1L, 0L), PIS_SHOPS = c(3L, 

24L, 13L, 3L, 0L, 0L, 6L, 28L, 2L, 11L, 71L, 16L, 2L, 5L, 

6L, 0L, 1L, 0L, 3L, 2L), PIS_SR = c(19L, 0L, 14L, 0L, 0L, 

0L, 2L, 23L, 0L, 3L, 6L, 0L, 0L, 20L, 0L, 0L, 3L, 32L, 1L, 

0L), QUANTITY = c(13L, 2L, 18L, 1L, 14L, 1L, 4L, 2L, 5L, 

1L, 5L, 2L, 2L, 4L, 1L, 3L, 2L, 8L, 17L, 8L), WKA = c(1L, 

1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 

0L, 0L, 1L, 1L), NEW_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), EXIST_CUST = c(1L, 

1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 

1L, 1L, 1L, 1L), WEB_CUST = c(1L, 0L, 0L, 0L, 1L, 1L, 0L, 

1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), MOBILE_CUST = c(0L, 

1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 

1L, 0L, 1L, 0L), TABLET_CUST = c(0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L), 

LOGON_CUST_STEP2 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 

0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(1L, 

2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 821039L, 821040L, 821041L, 

821042L, 821043L, 821044L, 821045L, 821046L, 821047L, 821048L

), class = "data.frame") 

enter image description here

enter image description here

Kitty123
  • 171
  • 2
  • 12
  • Does this answer your question: https://stackoverflow.com/questions/57924068/how-to-get-around-error-factor-has-new-levels-in-cross-validation-glm/57937180#57937180 – Dave2e Apr 03 '20 at 15:24
  • @ Dave2e I think it's going in the right direction. Where would I insert group by in the function? My task is to identify the online behaviour of users. The variables represent the number of page views per product page, number of shopping baskets. As can be seen from the descriptive statistics and graphs, the distribution of the variables is right skewed. How can I take into account uneven distributions of my variables in the intervals and how do I choose my intervals and sampling size? – Kitty123 Apr 03 '20 at 16:51

1 Answers1

2

Here is a solution to perform a stratified sampling based on multiple columns. Before implementing this, consider that your data is continuous and a sufficiently large that just a random sampling is adequate.

To solve this problem is to take a stratified sample from each group. The potential approaches to group the data together is by either pasting the 9 columns together or using dplyr's groupby function.

Using the solution is this question How to get around error "factor has new levels" in cross-validation glm? and updating with dplyr style.

This dplyr_stratified function will take the desired sampling ration and an arbitrary number of column and will return a data frame with the sampled rows. See the example below for taking 2 columns.

set.seed(1)
x <- rnorm(n = 100)
y <- rep(x = c("A","B"), times = c(50,50))
z <- rep(x = c("D","E","F"), times = c(33,33,34))
data <- data.frame(x, y=sample(y, replace = TRUE), z=sample(z, replace=TRUE))

library(dplyr)
#optional tag row for later identification: 
data$rowid<-1:nrow(data)
dplyr_stratified <- function(df, percent, ...){
  columns<-enquos(...)
   #group then sample each group
  out<-df %>% group_by(!!!columns)  %>% slice( sample(1:n(), percent*n())) 
}

testgroup<-dplyr_stratified(data, 0.8, z, y)
testgroup

Note: this is assuming each grouping will have a sufficient number of sample in order to select a representative sample. (If the groups are too small then this approach may not meet expectations)

Dave2e
  • 22,192
  • 18
  • 42
  • 50