data.table sampling from two tables

Question

Is there a better way to do the below? I am using R's data.table to carry out some sampling.

It is trying to sample from a table(samp.from.data) using the weights a specific number based on the count so that it can be added back to the original data...

count.data <- data.table(CP=LETTERS[1:10],
                         count=sample(10:60,10,replace=TRUE))

orig.data <- data.table(CP=rep(LETTERS[1:10],times=count.data$count),
                        vc=sample(letters[1:6],size=sum(count.data$count),replace=TRUE))

# check that count.data is a good representation of orig.data
orig.data %>% group_by(CP) %>% summarise(count=n())


samp.from.data <- data.table(CP=rep(LETTERS[1:10],each=20),
                             UID=seq(200),
                             weight=runif(200,1,2))

setkey(count.data,'CP')
setkey(samp.from.data,'CP')
setkey(orig.data,'CP')

ll <- count.data[samp.from.data,]

ll1 <- ll[,.SD[sample(.N,head(count,1),replace=TRUE,prob=weight)],by=CP]
setkey(ll1,'CP')

# Add in the sampled values to the original data
# Is there a better way to do the sampling add adding back into original data more directly?
orig.data$UID <- ll1[,UID]

I don't get the "check the count.data is good" part. You are checking the `orig.data` there in fact. You could also do it with a simple `data.table` syntax btw- `orig.data[, .N, by = CP]`. You also should probably add some `set.seed`s here as you doing a lot of random sampling. — David Arenburg, Jul 17 '15 at 13:18
Also, see [here](http://stackoverflow.com/questions/30358077/efficiently-merge-random-keyed-subset/), it seems related. — David Arenburg, Jul 17 '15 at 13:53

data.table sampling from two tables

0 Answers0