2

I have a set of data and I need to sample it. Part of data is like below:

row.names  customer_ID
1           10000000
2           10000000
3           10000000    
4           10000000
5           10000005
6           10000005
7           10000008
8           10000008
9           10000008
10          10000008
11          10000008
12          10000008
...

take the first 2 rows from each customer then before including the next row do a check: there is a 65% chance we take the next row and 35% chance we quit and move to the next customer. If we take the row, we do it again 65% and 35% until we run out of data for the customer or we are fail the check and move to the next customer anyway. Repeat this for each customer

sgibb
  • 25,396
  • 3
  • 68
  • 74
  • 1
    The classical question, what have you tried and what where the problems you ran into? – Hidde Apr 12 '14 at 15:24
  • 1
    Welcome on SO: Please read: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example?rq=1 – sgibb Apr 12 '14 at 15:26

1 Answers1

1

The process for determining how many rows to take from a customer is basically a negative binomial distribution. Assuming your data's stored in dat:

# Split your data by customer id
spl <- split(dat, dat$customer_ID)

# Grab the correct number of rows from each customer
set.seed(144)
spl <- lapply(spl, function(x) x[seq(min(nrow(x), 2+rnbinom(1, 1, 0.35))),])

# Combine into a final data frame
do.call(rbind, spl)
#            row.names customer_ID
# 10000000.1         1    10000000
# 10000000.2         2    10000000
# 10000000.3         3    10000000
# 10000000.4         4    10000000
# 10000005.5         5    10000005
# 10000005.6         6    10000005
# 10000008.7         7    10000008
# 10000008.8         8    10000008
# 10000008.9         9    10000008
josliber
  • 43,891
  • 12
  • 98
  • 133