1

I need to make a model which could find probability that a registered user will buy some plan or no plan (i.e., will use just a free plan or won't do anything) and if they do, after what time. I have data with around 13 000 rows and around 12 000 of them are free users ( never paid - 0 value ) and the other 1 000 paid after some time ( from 1 to 690 days) also I have some count and categorical data - country, number of user clients, how many times he used plan, plan (premium, free, premium plus).

The mean of time after they paid or not is around 6.37 and variance is 1801.17, without zeros - 100 and 19012, which suggests to me that I should use a negative binomial model.

But I'm not sure which model fits best; I'm thinking about a zero-inflated negative binomial or hurdle model.

Here is histogram of diff.time with 0 and without 0 data :

histogram of data without 0

I tried these models with the pscl package:

summary(m1 <- zeroinfl(diff.time3 ~ 
    factor(Registration.country) + factor(Plan) + Campaigns.sent + 
         Number.of.subscribers |
    factor(Registration.country) + factor(Plan) + Campaigns.sent + 
         Number.of.subscribers, 
data=df , link="logit",dist= "negbin"))

or the same with hurdle() but they gave me an error :

Error in quantile.default(x$residuals): missing values and NaN's not allowed if 'na.rm' is FALSE In addition: Warning message: glm.fit: algorithm did not converge

with hurdle():

Error in solve.default(as.matrix(fit_count$hessian)) : Lapack routine dgesv: system is exactly singular: U[3,3] = 0

I have never tried these models before so I'm not sure how to fix these errors or if I chose the right models.

Unfortunately, I have no opportunuty to share some part of my data, but I'll try to explain them:

1st column "plan" - most of the data are "free"(around 12 000), also "Earning more", "Premium" or "Premium trial", where "free" and "premium trial" are not paid. 2nd column "Plan used" - around 8 000 rows are 0, 1 000 - 1, 3 000 - from 1 to 10 and another 1 000 from 10 to 510 3th column "Clients" describes how many clients user have - around 2 000 have 0, 4 0000 - 1-10, 3 000 - 10-200, 2 000- 200-1000, 2 000 - 1000- 340 000 4th column "registration country" - 36 different countries, over half of data is united states, other have from 5 to few hundreds rows. 5th column is diff.time which should be my dependent variable, as I said before most of the data are 0 (12 000) and others variuos from 1 day to 690 days)

  • What you're doing seems reasonable, but it's hard to know exactly what's going on from what you've presented. Can you please include (minimal) data and/or code that will provide us with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) ? – Ben Bolker Mar 30 '16 at 15:25
  • Does `Plan` offer information as to the 0s? I.e. `Plan == "free"` would imply 0 time to pay, in which case you're going to run into problems in the binomial part of the model as you can completely predict the 0s - complete separation. – Gavin Simpson Mar 30 '16 at 15:27
  • Sadly, I have no ability to give such example, all I can do just describe that data. Plan "free" usually gives 0, but there are some extentions (around 250 of 12 000 are not equal to 0, because they had this status when they first registered. – Greta Juknaitė Mar 30 '16 at 16:24
  • You can always simulate a little bit of fake data with a similar structure to your real data. [See here for tips on sharing data](http://stackoverflow.com/q/5963269/903061), see also the [wakefield package](https://github.com/trinker/wakefield) for creating random data. – Gregor Thomas Mar 30 '16 at 16:26
  • https://drive.google.com/file/d/0B8niceJono5lV082aVZoSWdIZXc/view?pref=2&pli=1 There is an example which you can download and load it into R. This is Rdata file – Greta Juknaitė Mar 30 '16 at 17:24

1 Answers1

2

If your actual data is similarly structured to the data you posted then you will have problems estimating a model like the one you specified. Let's first have a look at the data you posted on the Google drive:

load("duom.Rdata")
table(a$diff.time3 > 0)
## FALSE  TRUE 
##   950    50 

Thus there is some variation in the response but not a lot. You have only 5% non-zeros, overall 50 observations. From this information alone it might seem more reasonable to estimate a bias-reduced binary model (brglm) to the hurdle part (zero vs. non-zero).

For the zero-truncated count part you can possibly fit a model but you need to be careful which effects you want to include because there are only 50 degrees of freedom. You can estimate the zero-truncated part of the hurdle model using the zerotrunc function in package countreg, available from R-Forge.

Also you should clean up your factors. By re-applying the factor function within the formula, levels with zero occurrences are excluded. But there are also levels with only one occurrence for which you will not get meaningful results.

table(factor(a$Plan))
## Earning much more              Free           Mailing           Premium 
##                 1               950                 1                24 
##     Premium trial 
##                24 
table(factor(a$Registration.country))
##  australia  Australia    Austria Bangladesh    Belgium     brasil     Brasil 
##          1        567          7          5         56          1         53 
##   Bulgaria     Canada 
##         10        300 

Also, you need to clean up the country levels with all lower-case letters.

After that I would start out by buidling a binary GLM for zero vs. non-zero - and based on those results continue with the zero-truncated count part.

Achim Zeileis
  • 15,710
  • 1
  • 39
  • 49