0

I have a big data with over 10 million entries.

I'm suppose to do any analysis I want on it and so I decided to focus on a subset of the population which was families in a certain country. So now I'm at about 150,000 entries. Now I have 26 variables and would like to run a logistic regression model on the data but R says

Error: cannot allocate vector of size 130.3 Gb

I'm assuming cause I just have too many variables. I tried searching up how to pick your variables for your model but functions like step require you to have the full model so I'm not sure how to proceed.

Am I supposed to eliminate variables I just don't think will have an effect on my response variables or is my data set still too big?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • 3
    Can you please share some of your code & data. 150.000 rows with 26 columns should not necessarily result in a vector of size 130GB. – Prometheus Jan 20 '18 at 21:08
  • what would cause that error? i can't reveal the data but the code i ran was model <- glm(is_booking ~ . , data = data, family= binomial()) – statsnewbie Jan 20 '18 at 21:25
  • 2
    @Becky did you convert categorical variables to dummy variables? If so, please check/eliminate those with a large number of levels. – Prometheus Jan 20 '18 at 21:33
  • If you insist on fitting your logistic-model in this way, then maybe `biglm` will do the trick. – J.R. Jan 20 '18 at 21:41
  • @Prometheus i guess that leads to my follow up question i have, so one of my variables is the country they want to book for so that has like over 220 levels but i would consider where they want to book at as an important variables so should i or should i not include it? – statsnewbie Jan 21 '18 at 07:10
  • @statsnewbie not sure that I get your questions. is country the target variable - value that you are trying to predict? if not, you can bin/group the levels with low distribution into one level. you should check: stats.stackexchange.com – Prometheus Jan 21 '18 at 21:54

1 Answers1

1

It would be nice if you provided a little bit more information. Nontheless...

The first step you should do, unless you are quite familiar with the data, is to preform Exploratory data analysis. More info here.

I assume you are having a supervised learning problem. In which case, you can plot the labeled outcome as per the different variables. See picture below.

kaggle

What you see on the image, is a distribution of the a variable - family size, as per the outcome of survival in the Titanic disaster.

As you iterate this step, you'll get better understanding which variables contain more relevant information for the prediction.

Soon after, you will also realize that you might need to build your own variables/columns from the original data. This is a process called Feature Engineering.

Only after, I think you will come across the problem of using more advanced statistical methods to feature selection. In that case, the caret package will come quite handy.

For a more detailed introduction to Machine Learning I would suggest you check on www.kaggle.com

Hope this helps.

Prometheus
  • 1,977
  • 3
  • 30
  • 57