I have a csv file which is more than 20gb. I can read the first few lines using readlines and then figure out which columns I want to import. Is it possible to import only these columns using h2o.importFile()
or some other way in h2o so that I am not loading unnecessary columns?
Asked
Active
Viewed 697 times
3

deepAgrawal
- 673
- 1
- 7
- 25
1 Answers
3
The h2o.importFile()
function does not support loading only a subset of the columns. Here are some work-arounds:
- Load in the entire dataset and use the
x
argument in any modeling function to ignore certain columns.fit <- h2o.gbm(x = good_cols, y = y, training_frame = train)
- Load in the entire dataset and then create a new H2OFrame which only contains the columns you want.
newdf <- df[, good_cols]
- Create a copy of your data on disk that contains only the columns you want. This is easy to do using the
cut
tool (example here).cut -d, -f2-4,6-10 train.csv > newtrain.csv

Erin LeDell
- 8,704
- 1
- 19
- 35
-
1+1 for `cut` (I am amazed I never knew about that command!). I guess the point of the question was "20gb is too big to fit in memory", so the other solutions won't be usable. – Darren Cook May 31 '18 at 17:41