I am trying to simulate mutation data with known parameters to use it further for testing regression functions. In this simulation I want mutation counts to be dependent on variables:
mutations ~ intercept + beta_cancer + beta_gene + beta_int + offset(log(ntAtRisk)))
where the offset parameter is the maximal number of counts that can theoretically happen.
Creating table with parameters
ncancers <- 20
ngenes <- 20
beta <- CJ(cancer = as.factor(0:ncancers), gene = as.factor(0:ngenes))
beta[, beta_cancer := rnorm(n = (ncancers+1), sd = 1)[cancer]]
beta[, beta_gene := rnorm(n = (ngenes+1), sd = 1)[gene]]
beta[, beta_int := rnorm(n = (ngenes+1)*(ncancers+1), sd = 1.5)]
beta[, ntAtRisk := abs(round(rnorm(n = (ngenes+1)*(ncancers+1), mean = 5000, sd = 2000), digits = 0))[gene]]
beta[, intercept := rnorm(n = (ngenes+1)*(ncancers+1), mean = 2, sd = 1)[gene]]
beta[cancer == "0", c("beta_cancer", "beta_int") := 0] # reference cancer type
beta[gene == "0", c("beta_gene", "beta_int") := 0] # reference gene
Simulating mutation counts
beta[, mu := exp(intercept + beta_cancer + beta_gene + beta_int + log(ntAtRisk))]
setkey(beta, cancer, gene)
dat <- beta
setkey(dat, cancer, gene)
dat[, mutations := rnbinom(n = nrow(dat), mu = mu, size = 1.5)]
dat[, mutations2 := MASS::rnegbin(n = nrow(dat),
mu = exp(intercept + beta_cancer + beta_gene +
beta_int + offset(log(ntAtRisk))),
theta = 1.5)]
mutations
and mutations2
are made using different functions where offset
variable either is included as a normal variable or, in second case, is specified as an offset.
However, the test I am doing doesn't pass any of them.
I need mutation counts to be no greater than ntAtRisk but it is not the case, unfortunately. I couldn't find on the internet how I can include the offset into the simulation. What are my options?
ggplot(dat, aes(ntAtRisk, mutations+0.5)) +
geom_point() +
xlim(0, max(dat$ntAtRisk)) +
ylim(0, max(dat$ntAtRisk)) +
geom_abline(color = "red")