0

I have an homework where I have to find the best classification model possible for a dataset. My training set consists of 733 observations with 90000 variables each.

My problem is the following : whenever I try to perform an operation on the dataset (mice, rpart, ...), I get an error "cannot allocate vector of size x Gb" with x being really huge like 30-60 Gb.

My question is : how can I deal with such huge dataset ?

Since there are not many observations but lots of feature variables, I believe that a solution can consist of deriving new feature variables from the existing one in order to reduce the number of variables but I don't know if it's possible in R and if it would be statistically correct.

I did some researches on Internet but I found nothing that helps me. I would be very grateful if someone can help me. It may be useful to precise that I have very little knowledge about R and statistics in general.

Thanks in advance for your response !

Julien Mertz
  • 465
  • 2
  • 8
  • 22
  • This problem turns up on this site at least once every few days, and it has already been well covered. You have a few choices, but the first thing to ask is whether you really need to be doing stats on a 30GB sized data set. Could you take a 1GB sample and still get meaningful results from that? Check the duplicate link for some options, including what you might be able to do if you really do need to work with such a large data set. – Tim Biegeleisen May 04 '19 at 16:04
  • Thanks for your answer ! I already read the duplicate link and don't find any answer because I know that I have to reduce the size of my dataset but I can't figure out how to do it. Since there are only 733 observations, I guess I cannot take only 30-40 observations since the model won't be relevant. So this means that I have to work with the feature variables but I don't know what I have to do. All the topics I found are with datasets that have lots of observations but only a few variables. My dataset is the opposite as these datasets so the topics don't really help. – Julien Mertz May 04 '19 at 16:14
  • OK? Use a dimension reduction approach ... – Roland May 04 '19 at 21:46

0 Answers0